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Reducing  Achievement  Gaps  in  Academic  Writing  for  Latinos  and  English 

Learners  in  Grades  7-12 


Carol  Booth  Olson,  Tina  Matuchniak,  Huy  Q.  Chung,  Rachel  Stumpf,  and  George  Farkas 

University  of  California,  Irvine 


This  study  reports  2  years  of  findings  from  a  randomized  controlled  trial  designed  to  replicate  and 
demonstrate  the  efficacy  of  an  existing,  successful  professional  development  program,  the  Pathway 
Project,  that  uses  a  cognitive  strategies  approach  to  text-based  analytical  writing.  Building  on  an  earlier 
randomized  field  trial  in  a  large,  urban,  low  socioeconomic  status  (SES)  district  in  which  98%  of  the 
students  were  Latino  and  88%  were  mainstreamed  English  learners  (ELs)  at  the  intermediate  level  of 
fluency,  the  project  aimed  to  help  secondary  school  students,  specifically  Latinos  and  mainstreamed  ELs, 
in  another  large,  urban,  low-SES  distinct  to  develop  the  academic  writing  skills  called  for  in  the  rigorous 
Common  Core  State  Standards  for  English  Language  Arts.  The  Pathway  Project  draws  on  well- 
documented  instructional  frameworks  that  support  approaches  that  incorporate  strategy  instruction  to 
enhance  students’  academic  literacy.  Ninety-five  teachers  in  16  secondary  schools  were  stratified  by 
school  and  grade  and  then  randomly  assigned  to  the  Pathway  or  control  group.  Pathway  teachers 
participated  in  46  hr  of  training  to  help  students  write  analytical  essays.  Difference-in-differences  and 
regression  analyses  revealed  significant  effects  on  student  writing  outcomes  in  both  years  of  the 
intervention  (Year  1 ,  d  =  0.48;  Year  2,  d  =  0.60).  Additionally,  Pathway  students  had  higher  odds  than 
control  students  of  passing  the  California  High  School  Exit  Exam  in  both  years. 

Keywords :  Latino  students,  English  learners,  teacher  professional  development,  impact  studies,  writing 
assessments 


In  its  vision  of  what  it  means  to  be  literate  in  the  21st  century, 
the  Common  Core  State  Standards  for  English  Language  Arts 
(CCSS-ELA)  prioritize  the  ability  to  analyze  and  interpret  chal¬ 
lenging  texts  using  academic  discourse  in  extended  pieces  of 
writing.  In  addition  to  specifying  specific  standards  for  each  grade 
level,  the  CCSS-ELA  present  College  and  Career  Readiness  An¬ 
chor  Standards  for  Reading  and  Writing  in  Grades  K  to  5  and 
Grades  6  to  12  that  define  the  skills  and  understandings  all  students 
must  demonstrate.  These  include  the  ability  to  “read  closely  to 
determine  what  the  text  says  explicitly  and  to  make  logical  infer¬ 
ences  from  it,”  (p.  10),  and  to  “write  arguments  to  support  claims 
in  an  analysis  of  substantive  topics  or  texts,  using  valid  reasoning 
and  relevant  and  sufficient  evidence”  (National  Governors  Asso¬ 
ciation  Center  for  Best  Practices  &  Council  of  Chief  State  School 
Officers,  2010,  p.  18).  As  is  evident  from  these  anchor  standards, 
the  CCSS-ELA  set  a  high  bar  for  all  students,  and  place  a  premium 
on  the  ability  to  analyze  and  interpret  challenging  texts  and  to 
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write  about  those  texts  using  academic  discourse  in  extended 
pieces  of  writing.  However,  results  from  the  most  recent  admin¬ 
istration  of  the  201 1  National  Assessment  of  Educational  Progress 
(NAEP)  in  writing  (U.S.  Department  of  Education,  Institute  of 
Educational  Sciences,  National  Center  for  Education  Statistics, 
2012a)  indicate  that  today’s  secondary  students  face  considerable 
challenges  in  meeting  these  standards.  For  example,  only  27%  of 
all  12th  graders  and  1 1%  of  Hispanic1  students  scored  at  the  level 
of  proficient  or  above  in  writing.  Most  alarming  is  that  only  1%  of 
English  learners  (ELs)  scored  at  the  level  of  proficient  or  above. 
Given  that  by  2020,  one  in  four  children  enrolled  in  America’s 
K-12  public  schools  will  be  Latino  (Maxwell,  2012),  and  that  ELs 
in  Grades  7  to  12  are  the  fastest-growing  segment  of  the  K-12 
student  population  (Francis,  Rivera,  Lesaux,  Keiffer,  &  Rivera, 
2006),  these  disparities  are  worrisome.  Because  writing  is  a  gate¬ 
keeper  for  college  admission  and  a  “threshold  skill”  for  hiring  and 
promotion  for  salaried  workers  (National  Commission  on  Writing 
for  America’s  Families,  Schools,  and  Colleges,  2004),  failure  to 
close  these  achievement  gaps  in  academic  writing  will  have  seri¬ 
ous  social  and  economic  consequences. 

Who  Are  ELs? 

The  nation  lacks  a  uniform  definition  and  classification  of  ELs. 
Hence,  ELs  are  called  many  names,  including  “English  language 
learners,”  “limited  English  proficient  students,”  “language  or  lin- 


1  We  use  the  term  Hispanic  whenever  the  data  are  referenced  as  such  in 
the  primary  source  (e.g.,  NAEP  data  uses  the  term  Hispanic).  All  other 
times  we  use  the  term  Latino. 
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guistic  minority  students,”  and  “second  language  learners.”  Forty 
years  ago,  many  viewed  ELs  as  a  relatively  homogeneous  group 
with  similar  instructional  needs.  This  stereotype  has  endured  de¬ 
spite  large  demographic  changes.  However,  ELs  today  are  a  di¬ 
verse  group  with  unique  experiences  and  backgrounds  (Harklau, 
Losey,  &  Siegal,  1999;  Matsuda,  Ortmeir-Hooper,  &  You,  2006). 
Currently,  ELs  constitute  approximately  10%  of  the  total  K-12 
population  (U.S.  Department  of  Education,  Institute  of  Educational 
Sciences,  National  Center  for  Education  Statistics,  2015).  Al¬ 
though  ELs  in  the  United  States  speak  more  than  350  languages, 
73%  speak  Spanish  as  their  first  language  (Batalova  &  McHugh, 
2010),  40%  have  origins  in  Mexico  (Hernandez,  Denton,  &  Mac¬ 
artney,  2008),  and  60%  of  ELs  in  Grades  6  through  12  come  from 
low-income  families  (Batalova,  Fix,  &  Murray,  2007;  Capps  et  al., 
2005).  At  the  same  time  that  EL  enrollments  have  increased  in 
U.S.  public  schools,  researchers  and  policymakers  have  high¬ 
lighted  large  literacy  gaps  based  on  students’  English  language 
proficiency. 

The  largest  numbers  of  ELs  in  our  schools  today  are  referred  to 
as  long-term  ELs  (LTELs;  Menken  &  Kleyn,  2009).  According  to 
Olsen  (2010),  these  are  students  who  have  been  educated  in  the 
United  States  since  age  6,  are  doing  poorly  in  school,  and  have 
major  gaps  in  knowledge  because  their  schooling  was  disrupted.  In 
Olsen’s  study  of  175,734  ELs,  the  majority  (59%)  were  LTELs 
who  were  failing  to  acquire  academic  language  and  struggling  to 
do  well  in  high  school.  They  may  come  from  homes  in  which  the 
primary  language  is  not  English,  but  they  themselves  may  speak 
only  English  or  they  may  switch  between  multiple  languages  and 
still  have  features  in  their  writing  attesting  to  their  multilingual 
status  (Valdes,  2001).  Limited  in  their  knowledge  of  academic 
registers  in  any  language,  these  students  are  often  mainstreamed 
into  regular  English  language  arts  classrooms.  Given  the  many 
demands  that  academic  writing  places  on  all  students  and  the  few 
opportunities  to  practice,  ELs  need  explicit  instruction  and  ongo¬ 
ing  support  as  they  strive  to  become  college  and  career  ready.  As 
Goldenberg  (2013)  reminds  us, 

It  should  be  clear  that  despite  progress  in  understanding  how  to 
improve  teaching  and  learning  for  the  millions  of  ELs  in  our  schools, 
many  gaps  remain.  The  challenges  posed  by  the  Common  Core  State 
Standards  make  those  gaps  more  glaring,  (p.  10) 

Overview  of  the  Pathway  Project  Professional 
Development  Program 

The  treatment  is  an  intensive  46-hr  professional  development 
program  in  which  secondary  teachers  learn  how  to  integrate  cog¬ 
nitive  strategy  instruction  into  process  writing  to  improve  stu¬ 
dents’,  specifically  Latinos’  and  mainstreamed  ELs’,  interpretive 
reading  and  text-based  analytical  writing  by  (a)  using  a  cognitive 
strategies  approach  to  reading  and  writing  instruction,  (b)  instruct¬ 
ing  students  to  revise  a  pretest  on-demand  writing  assessment  into 
multiple  draft  essays,  and  (c)  receiving  ongoing  support  from 
experienced  Pathway  Project  teachers  who  serve  as  coaches  to 
teachers  in  the  experimental  condition.  Results  from  earlier  studies 
(Kim  et  al.,  2011;  Olson  et  al.,  2012;  Olson  &  Land,  2007;  Olson, 
Land,  Anselmi,  &  AuBuchon,  2010)  suggest  that  integrating  strat¬ 
egy  instruction  within  a  text-based  approach  to  analytical  writing 
can  enhance  Latinos’  and  ELs’  writing  ability. 


The  efficacy  of  the  Pathway  Project  owes  much  to  the  teachers- 
teaching-teachers  model  of  the  National  Writing  Project,  with  its 
inherent  respect  for  the  capacity  of  practitioners  to  generate  and 
use  knowledge  to  inform  and  improve  their  practice.  Ongoing  and 
sustained  professional  development  based  on  the  analysis  of  stu¬ 
dent  work  contributes  to  the  establishment  of  a  professional  learn¬ 
ing  community  dedicated  to  the  academic  progress  of  students. 
This  collaboration  creates  a  network  of  language  arts  classes  in 
which  highly  trained  teachers  prepare  all  students,  but  specifically 
Latinos  and  ELs,  to  develop  the  interpretive  reading  and  analytical 
writing  abilities  necessary  for  academic  success. 

Why  Take  a  Cognitive  Strategies  Approach  to 

Teaching  Text-Based  Analytical  Writing  to  Latinos 
and  Mainstreamed  ELs? 

The  Pathway  Project  professional  development  takes  a  cognitive 
strategies  approach  to  reducing  the  achievement  gap  between 
Latinos,  ELs,  and  their  native-English-speaking  peers  in  the  area 
of  text-based  academic  writing.  Numerous  reports  from  policy 
centers  and  blue-ribbon  panels  “implicate  poor  understandings  of 
cognitive  strategies  as  the  primary  reason  why  adolescents  struggle 
with  reading  and  writing”  (Conley,  2008,  p.  84;  see  also  Graham, 
2006;  Snow  &  Biancarosa,  2003).  Further,  research  conducted 
over  the  past  15  years  on  the  content  of  college  courses  and 
instructor  expectations  indicates  that  cognitive  strategy  use  is  the 
key  to  college  and  career  readiness  (Conley,  2013).  The  cognitive 
strategies  intervention  that  is  the  focus  of  this  study  is  grounded  in 
a  wide  body  of  research  on  what  experienced  readers  and  writers 
do  when  they  construct  meaning  from  and  with  texts.  Countless 
studies  demonstrate  the  efficacy  of  cognitive  strategy  use  in  read¬ 
ing  (Block  &  Pressley,  2002;  Duke  &  Pearson,  2008;  National 
Institute  of  Child  Health  &  Human  Development,  2000;  Paris, 
Wasik,  &  Turner,  1991;  Tierney  &  Pearson,  1983;  Tierney  & 
Shanahan,  1991).  Similarly,  Graham  and  Perin  (2007)  indicate  that 
strategy  instruction  is  the  most  effective  of  11  key  elements  of 
writing  instruction  ( d  =  .82)  for  all  students,  and  particularly  for 
students  who  find  writing  challenging. 

Increasingly,  recent  instructional  frameworks  and  recommen¬ 
dations  also  support  approaches  that  incorporate  strategy  in¬ 
struction  to  advance  ELs’  development  of  English  (Calderon, 
Slavin,  &  Sanchez,  2011;  Francis  et  al.,  2006;  Goldenberg,  2008; 
Schleppegrell,  2009).  Short  and  Fitzsimmons  (2007)  hypothesize 
that  strategy  instruction  is  especially  effective  for  ELs  because  it 
provides  them  with  an  explicit  focus  on  language,  increases  their 
exposure  to  academic  texts,  makes  the  texts  they  read  comprehen¬ 
sible,  gives  them  multiple  opportunities  to  affirm  or  correct  their 
understanding  and  use  of  language,  assists  them  in  retrieving  new 
language  features  and  in  using  these  features  for  academic  pur¬ 
poses,  and  provides  them  with  the  means  of  learning  language  on 
their  own,  outside  of  class.  They  further  hypothesize  that  adoles¬ 
cent  ELs  of  an  intermediate  level  of  English  proficiency,  who 
represent  the  majority  of  LTELs  in  California  (Olsen,  2010),  have 
sufficient  proficiency  to  benefit  from  strategy  instruction  (Eche¬ 
varria,  Short,  &  Vogt,  2012;  Short  &  Fitzsimmons,  2007)  because 
they  possess  the  language  proficiency  required  to  use  the  types  of 
cognitive  strategies  that  will  provide  them  access  to  the  higher 
order  cognitive  reading  and  writing  tasks  encountered  in  regular 
content  instruction.  Explicitly  teaching  strategic  reading  and  writ- 
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ing  behaviors  to  ELs  can  help  them  engage  with  complex  texts  and 
convey  those  interpretations  in  well-reasoned  essays  to  meet  the 
CCSS-ELA  (August  &  Shanahan,  2006;  Bunch,  Kibler,  &  Pimen¬ 
tel,  2012;  Francis  et  al.,  2006;  Goldenberg,  2008). 

Although  research-based  practices  for  developing  cognitive 
strategies  are  recommended  as  the  “pathway  for  literacy  reform  in 
middle  and  high  schools”  (Conley,  2008,  p.  85),  very  little  of  this 
type  of  instruction  occurs  in  school,  especially  for  Latinos  in  low 
socioeconomic  status  (SES)  schools  and  for  ELs  (Block  &  Press- 
ley,  2002;  Conley,  2013;  Graham  &  Perin,  2007;  Kong  &  Pearson, 
2003;  Vaughn  &  Klinger,  2004).  Indeed,  researchers  have  noted  a 
“growing  inequality”  in  classroom  instruction,  in  which  students 
designated  as  “honors  students”  are  exposed  to  rigorous  academic 
work  designed  to  promote  higher  literacy,  whereas  lower  achiev¬ 
ers,  children  of  the  poor,  and  second-language  learners  often 
receive  instruction  that  places  a  premium  on  the  “transmission  of 
information,  providing  very  little  room  for  the  exploration  of  ideas, 
which  is  necessary  for  the  development  of  deeper  understanding” 
(Applebee,  Langer,  Nystrand,  &  Gamoran,  2003,  p.  689).  Accord¬ 
ing  to  a  Carnegie  Corporation  report,  inadequate  educator  capacity 
and  the  limited  use  of  research-based  instructional  practices  pre¬ 
vent  ELs  from  learning  academic  English  at  a  level  necessary  to 
meet  content  standards  in  English  language  arts  (Short  &  Fitzsim¬ 
mons,  2007).  This  negatively  impacts  their  ability  to  participate 
fully  in  educational  programs,  successfully  complete  coursework, 
and  achieve  the  academic  outcomes  of  which  they  are  capable. 

Theoretical  Framework  and  Theory  of  Change 

The  Pathway  Project  is  informed  by  cognitive,  sociocognitive, 
and  sociocultural  theory.  In  their  cognitive  process  theory  of 
writing,  Flower  and  Hayes  (1981)  posit  that  writing  is  best  under¬ 
stood  “as  a  set  of  distinct  thinking  processes  which  writers  orches¬ 
trate  and  organize  during  the  act  of  composing”  (p.  275),  including 
planning,  organizing,  goal  setting,  translating,  monitoring,  review¬ 
ing,  evaluating,  and  revising.  They  liken  these  processes  to  a 
“writer’s  tool  kit”  (p.  285),  which  is  not  constrained  by  any  fixed 
order  or  series  of  stages.  Similarly,  Tierney  and  Pearson  (1983) 
propose  that  “reading  and  writing  are  essentially  similar  processes 
of  meaning  construction”  (p.  568),  and  that  readers  compose 
meaning,  creating  drafts  of  their  understanding,  just  like  writers 
do.  The  characteristics  they  identify  as  being  essential  to  the 
meaning  construction  of  readers  are  planning,  drafting,  aligning, 
revising,  and  monitoring.  Concurring  with  Flower  and  Hayes,  they 
note  that  this  cognitive  process  involves  continuous,  recurring,  and 
recursive  transactions  between  readers  and  writers.  Shanahan 
(2016)  notes  that  this  shared  cognition  model  conceptualizes  read¬ 
ing  and  writing  as  being  built  upon  a  similar  foundation,  and  that 
both  readers  and  writers  utilize  procedural  knowledge  about  how 
to  access,  use,  and  generate  information  intentionally  and  enact 
cognitive  strategies  such  as  predicting,  questioning,  and  summa¬ 
rizing  in  the  act  of  meaning  construction. 

However,  writers  and  readers  do  not  construct  meaning  in  a 
vacuum.  Langer  (1991)  asserts  that  literacy  comprises  more  than 
the  individual’s  ability  to  read  and  write.  Drawing  on  the  work  of 
Vygotsky  (1986),  she  suggests  that  literacy  is  the  ability  to  think 
and  reason  like  a  literate  person  within  a  particular  society.  In 
other  words,  literacy  is  culture-specific  and  meaning  is  socially 
constructed.  She  writes, 


Within  social  settings,  both  at  home  and  in  school  students  learn  how 
literacy  is  used  and  how  literate  knowledge  is  communicated — what 
counts  as  a  literacy  event  and  what  literacy  behaviors  “look  like,” 
what  literacy  related  values  are  respected,  and  what  literacy  habits  are 
to  be  cultivated.  As  children  learn  to  engage  in  literate  behaviors  to 
serve  the  functions  and  reach  the  ends  they  see  modeled  around  then, 
they  become  literate  in  a  culturally  specific  way;  they  use  certain 
cognitive  strategies  to  structure  their  thoughts  and  complete  their 
tasks,  and  not  others,  (p.  17) 

From  a  sociocognitive  perspective,  teachers  should  pay  more 
attention  to  the  social  purposes  to  which  literacy  skills  are  applied, 
and  should  go  beyond  delivering  lessons  on  content  to  impart 
strategies  for  thinking  necessary  to  complete  literacy  tasks,  first 
with  guidance  and,  ultimately,  independently. 

Finally,  sociocultural  theory  views  meaning  as  being  “negoti¬ 
ated  at  the  intersection  of  individuals,  culture,  and  activity”  (En- 
glert,  Mariage,  &  Dunsmore,  2006,  p.  208).  Three  tenets  of  socio¬ 
cultural  theory  are  applicable  to  the  intervention  (Adapted  from 
Englert,  Mariage,  &  Dunsmore,  2006): 

1.  Cognitive  apprenticeships:  Sociocultural  theory  pro¬ 
motes  the  power  of  cognitive  apprenticeships  in  which 
novices  learn  literate  behaviors  through  the  repeated 
modeling  of  more  mature,  experienced  adults  or  peers 
who  make  the  tacit  knowledge  of  meaning  construction 
visible  through  explicit  instruction,  as  well  as  think- 
alouds  to  provide  access  to  strategies  and  tools  demon¬ 
strated  by  successful  readers  and  writers  (Vygotsky, 
1986).  Jerome  Bruner  (1978)  used  the  term  scaffolding  to 
describe  this  “tutorial  assistance”  provided  to  the  appren¬ 
tice  by  the  veteran. 

2.  Procedural  facilitators  and  tools:  A  second  tenet  of  so¬ 
ciocultural  theory  is  that  teachers  are  most  effective  when 
they  lead  cognitive  development  in  advance  of  what 
students  can  accomplish  alone  by  presenting  challenging 
material  along  with  procedural  and  facilitative  tools  to 
help  readers  and  writers  address  those  cognitive  chal¬ 
lenges.  These  mental,  linguistic,  and  physical  tools  help 
bridge  the  gap  between  what  a  student  can  perform  alone 
and  what  Vygotsky  (1986)  refers  to  as  the  zone  of  prox¬ 
imal  development,  and  can  include  notational  systems, 
think  sheets,  graphic  organizers,  prompts,  planning  strat¬ 
egies,  mnemonics,  text  structures,  cue  cards,  and  more. 

3.  Community  of  practice:  Third,  and  perhaps  most  impor¬ 
tantly,  sociocultural  theory  values  the  establishment  of 
communities  of  practice  in  which  teachers  actively  en¬ 
courage  students  to  collaborate  and  provide  ongoing  op¬ 
portunities  and  thoughtful  activities  that  invite  students  to 
engage  in  shared  inquiry.  Tompkins  (2010)  compares  this 
classroom  community  approach  with  the  difference  be¬ 
tween  owning  and  renting  a  home: 

In  a  classroom  community,  students  and  teachers  are  joint  “owners”  of 
the  classroom.  Students  assume  responsibility  for  their  own  learning 
and  behavior,  work  collaboratively  with  classmates,  complete  assign¬ 
ments,  and  care  for  the  classroom.  In  traditional  classrooms,  in  con¬ 
trast,  the  classroom  is  the  teacher’s  and  the  students  are  simply  renters 
for  the  school  year.  (p.  16) 
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In  the  Pathway  Project,  all  three  theoretical  frames — cognitive, 
sociocognitive,  and  sociocultural — apply  to  working  with  teachers 
as  well  as  to  working  with  students,  as  teachers  are  introduced  to 
the  cognitive  strategies  that  underlie  the  reading  and  writing  pro¬ 
cess,  provided  with  a  variety  of  tools  to  make  strategy  use  “visible” 
to  students,  and  engage  in  professional  learning  community  school 
teams  to  implement  what  they  learn  during  professional  develop¬ 
ment  in  their  classrooms. 

The  theory  of  change  underlying  the  Pathway  Project  interven¬ 
tion  is  that  providing  teachers  with  sustained,  high-quality  profes¬ 
sional  development  focused  on  helping  students  improve  their 
academic  writing  would  prompt  them  to  implement  cognitive 
strategies  instruction  in  their  classrooms  (proximal  outcome),  re¬ 
sulting  in  enhanced  student  performance  on  a  Pathway-developed 
pre-post  timed  writing  assessment  (intermediate  outcome),  trans¬ 
ferring  to  improvement  on  summative  assessments  of  state  content 
standards  (distal  outcome;  see  Figure  1). 

This  logic  model  was  developed  based  on  findings  from  an 
earlier  4-year  randomized  controlled  trial  conducted  in  the  Santa 
Ana  Unified  School  District  (SAUSD),  a  large,  urban,  low-SES 
school  district  with  98%  Latino  students,  88%  mainstreamed  ELs 
at  the  intermediate  level  of  fluency  or  above,  and  only  7%  of 
English-only  (EO)  speakers.  This  prior  study  yielded  effect  sizes 
of  .35  in  Year  1  and  .67  in  Year  2  on  an  on-demand  assessment  of 
academic  writing.  It  also  showed  that  the  Pathway  Project  had  a 
small  but  statistically  significant  impact  on  the  writing  subtest  of 
the  California  Standards  Test  (CST;  Kim  et  ah,  2011;  Olson  et  ah, 
2012).  However,  because  98%  of  the  students  in  the  SAUSD  were 


Latino  and  only  7%  were  classified  as  EO,  it  was  not  possible  to 
compare  the  progress  of  Latinos  or  ELs  with  other  subgroups.  Like 
the  SAUSD,  the  Anaheim  Union  High  School  District 
(AUHSD) — the  site  of  the  current  study — is  a  large,  low-SES, 
urban  district  serving  over  33,000  students,  but  the  student  popu¬ 
lation  is  more  diverse  (3%  African  American,  16%  Asian,  64% 
Hispanic,  12%  White,  1%  Pacific  Islander,  3%  Other;  23%  EO), 
allowing  for  comparisons  between  race/ethnicity  and  language 
subgroups,  which  was  not  possible  in  the  prior  (SAUSD)  study. 

Our  present  study  is  designed  to  answer  the  following  research 
questions: 

1 .  To  what  extent  will  teachers’  participation  in  the  Pathway 
Project  professional  development  program  improve  aca¬ 
demic  outcomes  for  secondary  students  in  Grades  7  to  12 
on  an  on-demand  writing  assessment  and  a  state- 
mandated  high  school  exit  exam  in  English  language 
arts? 

2.  To  what  extent  does  the  intervention  have  a  differential 
effect  on  academic  outcomes  for  targeted  race/ethnicity 
and  language  proficiency  subgroups  on  an  on-demand 
writing  assessment  and  a  state-mandated  high  school  exit 
exam  in  English  language  arts? 

Given  the  considerable  cognitive  burden  that  text-based  aca¬ 
demic  writing  places  on  students  in  general,  and  ELs  in  particular, 
we  hypothesized  that  taking  a  cognitive  strategies  approach  to 
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Figure  1.  Theory  of  change  model  for  the  pathway  project.  Source:  Olson  et  al.  (2012).  Reprinted  with  permission. 
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literacy  instruction  would  result  in  a  positive  impact  on  student 
writing  outcomes. 

Method 

Study  Design 

This  study  utilized  a  multisite,  cluster,  randomized  field  trial  in 
which  secondary  schools  were  the  sites,  teachers  were  randomly 
assigned  to  the  Pathway  Project  (hereafter,  “Pathway”  condition  or 
group)  or  control  condition  in  the  intervention,  and  one  of  each 
teacher’s  classes  was  randomly  selected  to  participate  in  the  study. 
Students  were  randomly  assigned  to  each  teacher’s  classes. 

Participants 

In  the  2012-2013  school  year,  the  AUHSD  enrolled  33,000 
students  in  Grades  7  to  12,  60%  of  whom  were  Latino,  66%  of 
whom  were  ELs  or  reclassified  fluent  English  proficient  (RFEPs), 
and  66%  of  whom  were  eligible  for  free  or  reduced-price  lunch 
(FRPL).  The  AUHSD  classifies  its  ELs  based  on  an  initial  lan¬ 
guage  placement  exam  using  the  California  English  Language 
Development  Test  (CELDT).  An  RFEP  student  is  a  student  that 
was  initially  identified  as  needing  EL  support  and  subsequently 
scoring  at  the  50th  percentile  or  below  on  the  CELDT,  but  through 
a  combination  of  receiving  proficient  scores  on  the  CELDT  (scor¬ 
ing  75th  percentile  or  above),  strong  academic  performance  in 
courses,  proficient  scores  on  other  standardized  tests,  and  parent 
recommendations,  the  student  is  placed  into  mainstreamed  English 
classrooms. 

Teacher  participants.  In  the  summer  of  2012,  100  teachers 
from  16  secondary  schools  in  the  district  (eight  middle  and  eight 
high  schools)  were  recruited  to  participate  in  the  study.  Of  these, 
95  consented  to  participate  in  the  study  and  were  randomly  as¬ 
signed,  separately  for  each  grade  within  each  school,  to  the  Path¬ 
way  or  control  condition.  The  study  sample  consisted  of  49  Path¬ 
way  teachers  and  46  control  teachers.  To  avoid  crossover  effects, 
Pathway  teachers  were  instructed  not  to  share  program  materials 
with  control  teachers.  Observation  of  control  classrooms  showed 
no  evidence  that  any  sharing  had  occurred.  On  average,  teachers  in 
the  study  had  14.82  years  of  total  teaching  experience  and  77% 
percent  had  earned  a  master’s  degree.  Although  teachers  received 
their  baccalaureate  degrees  from  21  different  undergraduate  insti¬ 
tutions,  a  majority  (56%)  graduated  from  a  California  State  Uni¬ 
versity.  There  was  no  statistically  significant  difference  between 
Pathway  teachers  and  control  teachers  in  the  total  years  of  teaching 
experience  (p  =  .45),  the  percentage  who  graduated  with  a  bach¬ 
elor’s  degree  from  a  California  State  University  (p  =  .79),  or  the 
percentage  who  earned  a  master’s  degree  (p  =  .15).  These  findings 
indicate  that  Pathway  and  comparison  group  teachers  were  similar 
on  observed  teacher  characteristics  measured  at  baseline. 

Student  participants.  In  the  first  step  of  a  three-step  process, 
school  counselors  employed  a  software  program  to  randomly 
assign  students  to  teachers’  classrooms.  Next,  recruited  teachers 
were  randomly  assigned  to  either  Pathway  or  control  conditions. 
Finally,  in  the  third  step,  because  of  resource  constraints,  for  both 
groups  of  teachers,  one  class  was  selected  as  the  focal  class  for  the 
study.  For  both  Pathway  and  control  teachers,  the  class  selected 
was  the  one  that  had  the  greatest  percentage  of  ELs,  and  in  which 


the  students  had  the  English  language  proficiency  necessary  to 
write  in  English  and  thereby  profit  from  the  enhanced  training 
received  by  the  teachers  who  participated  in  the  program. 

Comparison  of  Pathway  and  Control  Classrooms 
at  Baseline 

Table  1  shows  descriptive  statistics  for  Pathway  and  control 
group  students  in  the  first  year  of  the  project,  as  well  as  a  test  of 
the  significance  of  between-groups  differences.  The  groups  have  a 
similar  distribution  across  grades,  with  a  slightly  smaller  percent¬ 
age  of  seventh  graders  in  the  Pathway  group.  The  gender  distri¬ 
bution  is  well  matched.  Latinos  and  Asians  constitute  the  great 
majority  of  all  students  in  the  schools,  and  the  percentages  of  these 
students  in  each  group  are  closely  matched.  However,  there  is  a 
small  tendency  for  a  higher  percentage  of  Whites  in  the  control 
group  and  a  higher  percentage  of  “other  race”  students  in  the 
Pathway  group.  ELs  are  a  slightly  higher  percentage  of  the  control 
group  (21%  vs.  18%)  and  Initially  Fluent  English  Proficient  (IFEP) 
students  are  a  slightly  higher  percentage  of  the  Pathway  group  (9% 
vs.  6%).  The  two  groups  have  the  same  percentage  (71%)  of 
students  qualified  for  FRPL.  Most  important,  there  are  no  signif¬ 
icant  between-groups  differences  in  pretest  scores  on  the  Aca¬ 
demic  Writing  Assessment  (AWA)  and  the  CST  for  English  Lan¬ 
guage  Arts  (CST-ELA).  Overall,  these  statistics  show  that  the 
random  assignment  for  the  first  year  of  the  project  was  successful 
in  producing  Pathway  and  control  groups  that  were  well  balanced 
on  student  characteristics. 

The  Pathway  and  control  groups  were  not  as  well  matched  at  the 
beginning  of  the  second  year  of  the  project.  This  is  because  the 

Table  1 


Baseline  Comparison  of  Pathway  and  Control  Groups,  Year  1 


Variable 

Control11 

M  or  %  SD 

Pathwayb 

M  or  %  SD 

z 

P 

7th  Grade  (%) 

.18 

.39 

.16 

.37 

-1.84 

.07 

8th  Grade  (%) 

.20 

.40 

.21 

.40 

.21 

.83 

9th  Grade  (%) 

.15 

.35 

.15 

.36 

.57 

.57 

10th  Grade  (%) 

.20 

.40 

.20 

.40 

-.02 

.98 

11th  Grade  (%) 

.15 

.36 

.16 

.37 

.83 

.40 

12th  Grade  (%) 

.12 

.32 

.12 

.33 

.35 

.73 

Male  (%) 

.51 

.50 

.50 

.50 

-.77 

.44 

Hispanic  (%) 

.68 

.47 

.68 

.47 

.37 

.71 

White  (%) 

.13 

.33 

.10 

.30 

-2.18 

.03* 

Asian  (%) 

.17 

.37 

.18 

.38 

.76 

.45 

Other  (%) 

.05 

.21 

.06 

.24 

1.84 

.07 

ELL  (%) 

.21 

.41 

.18 

.39 

-2.07 

.04* 

RFEP  (%) 

.44 

.50 

.46 

.50 

1.34 

.18 

IFEP  (%) 

.06 

.24 

.09 

.28 

2.89 

.004** 

EO  (%) 

.29 

.46 

.27 

.45 

-1.33 

.18 

FRPL  (%) 

.71 

.45 

.71 

.45 

-.17 

.87 

AWA  Pre  (M) 

CA  Standards  Test 

5.69 

1.92 

5.72 

2.05 

-.34 

.73 

(ELA;  M) 

358 

55.6 

360 

56.3 

-.51 

.68 

Note.  Other  =  Blacks  and  Native  Americans;  ELL  =  English  Language 
Learner;  RFEP  =  Reclassified  as  Fluent  English  Proficient;  IFEP  = 
Initially  Fluent  English  Proficient;  EO  =  English  Only;  FRPL  =  eligible 
for  free/reduced  price  lunch.  AWA  =  Academic  Writing  Assessment; 
CA  =  California;  ELA  =  English  Language  Arts. 
a46  teachers,  1,493  students.  b  49  teachers,  1,705  students. 
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design  called  for  both  Pathway  and  control  teachers  to  continue  in 
their  respective  roles.  The  goal  was  to  see  whether  Pathway 
teachers  would  be  able  to  achieve  even  larger  positive  effects 
following  a  year  of  experience  with  the  program.  However,  there 
was  teacher  attrition  between  the  years  (Pathway  teachers  declined 
from  49  to  41  and  controls  from  46  to  40),  and  this  altered  the 
demographics  of  their  students  (in  which  the  Pathway  students 
were  receiving  the  program  for  the  first  time  and  the  controls  had 
never  experienced  the  program).  This  change  in  the  demographic 
match  of  Pathway  and  control  classrooms  is  unfortunate,  but  was 
impossible  to  avoid  because  schools  in  the  study  had  both  differ¬ 
ential  demographics  and  differential  teacher  attrition.  The  results 
are  shown  in  Table  2.  The  Pathway  students  in  the  second  year  of 
the  project  were  more  concentrated  in  ninth  and  12th  grades,  and 
the  control  students  in  eighth  and  11th  grade.  The  Pathway  group 
had  slightly  more  female  students  than  the  control  group.  The 
Pathway  group  also  had  significantly  fewer  Latino  and  more  Asian 
students  than  the  control  group;  it  also  had  a  somewhat  higher 
share  of  IFEP  students.  Finally,  the  Pathway  group  had  a  signifi¬ 
cantly  lower  share  of  FRPL  students  than  the  controls  (66%  vs. 
72%).  They  also  had  significantly  higher  AWA  test  scores,  5.48 
versus  5.1 1,  a  7.2%  difference.  These  are  all  variables  that  will  be 
controlled  in  our  regression  analysis  estimates  of  program  effects. 

Year  1  analysis  sample  for  student  achievement  analysis. 
The  analysis  sample  was  smaller  than  the  baseline  sample  because 
there  were  two  Pathway  teachers  who  did  not  administer  the  AWA 
at  posttest.  However,  there  were  no  statistically  significant  differ¬ 
ences  between  the  students  of  these  two  teachers  compared  with 
the  rest  of  the  sample  in  terms  gender,  FRPL  eligibility,  language 
proficiency  status,  and  the  number  of  Latino  students.  As  a  result, 
the  final  sample  size  for  the  AWA  was  93  teachers  (in  Grades  7  to 

Table  2 


Baseline  Comparison  of  Pathway  and  Control  Groups,  Year  2 


Variable 

Control2 

M  or  %  SD 

Pathway*1 

M  or  %  SD 

z 

P 

7th  Grade  (%) 

.28 

.45 

.29 

.45 

-.35 

.724 

8th  Grade  (%) 

.21 

.41 

.15 

.36 

3.26 

.oo  r* 

9th  Grade  (%) 

.10 

.30 

.17 

.38 

-4.67 

.000*** 

10th  Grade  (%) 

.13 

.34 

.15 

.35 

-.89 

.371 

1 1th  Grade  (%) 

.21 

.41 

.10 

.31 

6.08 

.000*** 

12th  Grade  (%) 

.07 

.26 

.14 

.34 

-4.61 

.000*** 

Male  (%) 

.52 

.50 

.48 

.50 

1.63 

.104 

Hispanic  (%) 

.72 

.45 

.66 

.47 

2.54 

.011* 

White  (%) 

.12 

.33 

.12 

.32 

.26 

.792 

Asian  (%) 

.14 

.34 

.19 

.39 

-2.90 

.004** 

Black  (%) 

.03 

.17 

.03 

.18 

-.62 

.536 

Native  American  (%) 

.05 

.22 

.07 

.25 

-1.30 

.193 

ELL  (%) 

.22 

.41 

.20 

.40 

.88 

.381 

RFEP  (%) 

.44 

.50 

.43 

.49 

.41 

.685 

IFEP  (%) 

.03 

.18 

.06 

.24 

-2.73 

.006** 

EO  (%) 

.32 

.46 

.31 

.46 

.03 

.975 

FRPL  (%) 

.72 

.45 

.66 

.47 

2.69 

.007** 

AWA  Pre  (M) 

5.11 

1.92 

5.48 

2.38 

-3.03 

.002** 

Note.  ELL  =  English  Language  Learner;  RFEP  =  Reclassified  as  Fluent 
English  Proficient;  IFEP  =  Initially  Fluent  English  Proficient;  EO  = 
English  Only;  FRPL  =  eligible  for  free/reduced  price  lunch;  AWA  = 
Academic  Writing  Assessment. 
a40  teachers,  939  students.  b41  teachers,  887  students. 
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12)  in  47  Pathway  classrooms  and  46  control  classrooms.  To 
adhere  to  budget  constraints,  we  scored  pre-  and  posttest  on- 
demand  writing  assessments  for  a  random  sample  of  20  students 
within  each  class.  There  was  no  significant  difference  on  pretest 
AWA  writing  scores  for  the  Pathway  (M  =  5.72,  SD  =  2.05)  and 
control  (M  =  5.69,  SD  =  1.92,  p  =  .73)  classrooms. 

Year  2  analysis  sample  for  student  achievement  analysis  on 
posttest  measures.  The  Year  2  analysis  sample  in  2013-2014 
(n  =  81  teachers)  was  smaller  than  the  final  Year  1  sample  in 
2012-2013  (n  =  93)  because  of  teacher  attrition.  As  a  result,  the 
final  sample  for  the  AWA  in  Year  2  consisted  of  40  control 
classrooms  and  41  Pathway  classrooms.  Although  14%  of  the 
teacher  sample  was  lost  to  attrition  from  Year  1  to  Year  2,  the 
attrition  rate  did  not  differ  significantly  between  the  two  groups 
(p  <  .42).  To  adhere  to  budget  constraints,  we  scored  posttest 
on-demand  essays  for  a  random  sample  of  20  students  within  each 
class. 

Description  of  the  Pathway  Project  Professional 
Development  Program 

Pathway  teachers  in  the  intervention  participated  in  46  hr  of 
training  each  school  year  (via  six  6-hr  released  days  interspersed 
throughout  the  school  year  and  five  2-hr  after-school  sessions) 
focused  on  methods  for  helping  Latinos  and  mainstreamed  ELs  to 
develop  the  academic  literacy  necessary  to  meet  the  CCSS-ELA, 
with  special  emphasis  on  interpretive  reading  and  analytical  writ¬ 
ing.  These  included  literary  response  and  analysis,  comprehension 
and  analysis  of  informational  nonfiction  texts,  and  development  of 
clear,  coherent,  focused  essays.  Training  was  led  by  the  developers 
of  the  Pathway  Project  and  supported  by  literacy  coaches  who 
participated  as  Pathway  teachers  in  a  previous  quasi-experimental 
research  study  (Olson  &  Land,  2007).  There  are  three  core  com¬ 
ponents  of  the  Pathway  Project:  (a)  training  in  the  use  of  cognitive 
strategies  tool  kit  and  curriculum  materials,  (b)  Pathway  Project 
activities  focused  on  the  revision  of  the  pretest  on-demand  writing 
assessment  into  a  multiple  draft  essay,  and  (c)  coaching  from  a 
more  experienced,  veteran  teacher  previously  trained  in  the  pro¬ 
gram  on  how  to  integrate  a  cognitive  strategies  approach  into  the 
existing  English  language  arts  curriculum. 

Cognitive  strategies  tool  kit  and  curriculum  materials. 
Strategy  instruction  in  this  intervention  occurred  within  the  context 
of  teaching  reading  and  writing  as  a  process  and  involves  preread¬ 
ing,  during  reading,  and  postreading  activities,  as  well  as  prewrit- 
ing,  planning,  drafting,  sharing,  revising,  and  editing  activities. 
The  Pathway  Project  was  initially  developed  over  an  8-year  period 
in  the  SAUSD  (Olson  &  Land,  2007),  in  response  to  the  specific 
needs  of  teachers  serving  a  rapidly  increasing  population  of  ELs. 
At  the  time,  few  interventions  were  available  that  addressed  the 
writing  challenges  of  ELs.  For  example,  in  a  recent  meta-analysis 
of  writing  instruction  (Graham,  McKepwn,  Kiuhara,  &  Harris, 
2012),  only  three  studies  (of  1 15)  for  which  an  effect  size  could  be 
computed  included  ELs.  Therefore,  the  Pathway  Project  and  its 
year-long,  six-full-day  professional  development  schedule  con¬ 
stantly  evolved  as  new  research  became  available.  During  the  first 
two  PD  days,  teachers  were  introduced  to  a  model  of  the  cognitive 
strategies  that  make  up  a  reader’s  and  writer’s  mental  tool  kit  in 
Figure  2.  These  thinking  tools  or  acts  of  mind  directly  map  on  to 
the  CCSS-ELA  Anchor  Standards  for  College  and  Career  Readi- 
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Figure  2.  Cognitive  strategies  tool  kit. 


ness  in  Reading  and  Writing  that  call  for  students  to  be  able  to  do 
the  following  as  they  read  and  write  about  complex  texts:  summa¬ 
rize,  make  inferences,  analyze,  interpret,  draw  conclusions,  eval¬ 
uate,  assess,  revise,  and  reflect.  The  CCSS-ELA  leaves  to  curric¬ 
ulum  developers,  states,  and  teachers’  “professional  judgment  and 
experience”  the  determination  of  what  “tools  and  knowledge” 
students  will  need  in  order  to  meet  these  challenging  new  stan¬ 
dards  (National  Governors  Association  Center  for  Best  Practices  & 
Council  of  Chief  State  School  Officers,  2010,  p.  4).  The  Pathway 
Project  focused  on  these  tools.  Teachers  used  the  following  anal¬ 
ogy  to  introduce  the  concept  of  the  tool  kit: 

When  we  read,  we  have  thinking-tools  or  cognitive  strategies  inside 
our  heads  that  we  access  to  construct  meaning.  Researchers  say  that 
when  we  read,  we’re  composing,  just  as  when  we  write.  What  they 
mean  is  that  while  we  read,  we’re  creating  our  own  draft  of  the  story 
in  our  heads  and  as  we  keep  reading  and  come  across  something  we 
didn’t  expect  to  happen  or  suddenly  make  a  big  discovery  about  what 
something  means,  we  start  on  a  second  draft  of  our  understanding.  So, 
when  you  think  of  yourself  as  a  reader  or  writer,  think  of  yourself  as 
a  craftsman,  skilled  in  making  things  with  your  hands,  but  instead  of 
reaching  into  a  metal  tool  kit  for  a  hammer  or  a  screwdriver  to 
construct  or  build  tangible  or  real  objects  you  can  actually  see,  you’re 
reaching  into  your  mental  tool  kit  to  construct  meaning  from  or  with 
words. 

To  reinforce  the  tool  kit  analogy,  teachers  received  wall  posters 
with  visuals  representing  the  cognitive  strategies,  and  students 
received  bookmarks  as  well  as  8.5  in.  X  11  in.  copies  of  cognitive 
strategies  sentence  starters  that  illustrate  what  goes  on  in  the  mind 
of  a  reader  or  writer  in  the  act  of  meaning  construction.  For 
example,  a  sentence  starter  for  revising  meaning  is  “At  first  I 
thought — but  now  I  .  .  and  a  starter  for  reflecting  and  relating 


is,  “So,  the  big  idea  is  .  .  .”  To  build  students’  declarative  knowl¬ 
edge  of  what  cognitive  strategies  are,  teachers  presented  scaffolded 
lessons  called  “tutorials”  (Bruner,  1978),  in  which  they  introduced 
each  of  the  tools  in  the  tool  kit  to  students  within  the  context  of 
reading  and  writing  about  high-interest  literary  or  nonfiction  texts. 
To  enhance  their  procedural  knowledge  of  how  to  implement  the 
strategies,  students  received  instruction  on  how  to  make  marginal 
annotations  to  interpret  complex  texts,  and  kept  reading  logs  with 
key  quotes  from  the  texts  they  were  reading  and  commentary  on 
those  quotes.  Finally,  to  foster  conditional  knowledge  of  when  to 
use  a  cognitive  strategy,  which  strategy  to  use,  and  why,  students 
were  taught  to  think  aloud  in  response  to  complex  texts  while  a 
partner  recorded  their  responses  and  then  labeled  their  strategy  use, 
as  well  as  wrote  metacognitive  reflections  describing  the  cognitive 
strategies  they  used  in  order  to  form  interpretations  about  texts  and 
write  analytical  essays.  The  Pathway  Project  provided  a  wide  array 
of  teacher-tested  and  easy-to-use  paper  and  computer-based  ma¬ 
terials  as  models  of  curriculum  and  instruction.  Because  these 
materials  were  designed  to  be  presented  to  students  across  the 
grade  levels  (7  to  12)  and  with  varying  degrees  of  language 
proficiency,  teachers  were  given  time  to  meet  in  grade-level 
groups  and  as  school  teams  to  discuss  how  to  modify  the  materials 
to  meet  their  specific  students’  needs.  Additionally,  optional  scaf¬ 
folded  materials  were  provided  for  struggling  students.  For  exam¬ 
ple,  a  unit  entitled  “The  Symbolism  Detour”  was  designed  for  EFs 
who  had  difficulty  understanding  figurative  language.  Teachers 
were  encouraged  to  gradually  withdraw  the  instructional  scaffold¬ 
ing  embedded  in  the  materials  as  students  demonstrated  their 
ability  to  implement  cognitive  strategies  in  reading  and  writing 
more  independently  (Pearson  &  Gallagher,  1983). 
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Formative  assessment  and  revision  of  pretest.  Second, 
teachers  learned  how  to  use  results  from  the  on-demand  analytical 
writing  pretest  (the  Analytical  Writing  Assessment  [AW A])  to 
provide  instruction  in  text-based  analytical  writing.  To  that  end, 
the  professional  development  focused  on  preparing  students  to 
read,  make  inferences,  and  form  interpretations  about  complex 
literary  and  nonfiction  texts  and  to  convey  interpretations  in 
thoughtful,  well-organized  essays  that  present  a  clear  thesis,  sup¬ 
ported  with  appropriate  textual  evidence.  The  centerpiece  of  the 
Pathway  Project  is  an  extensive  set  of  materials  shared  during 
Days  3  and  4  of  the  professional  development  program,  focused  on 
the  revision  of  the  students’  pretest  writing  assessment  (a  text- 
based  analytical  essay)  into  a  multiple-draft  essay.  Student  perfor¬ 
mance  on  this  timed,  on-demand  pretest  essay  was  used  to  inform 
the  Pathway  Project  as  teachers  engaged  in  analyzing  students’ 
work  and  identifying  students’  strengths  and  areas  for  growth 
(Biancarosa  &  Snow,  2004;  Black  &  Wiliam,  1998). 

Two  of  the  10  Common  Core  College  and  Career  Readiness 
Standards  for  Reading  for  Grades  6-12  focus  on  determining  and 
analyzing  themes  in  both  single  and  multiple  texts.  Additionally,  in 
a  national  survey  of  2,351  high  school  and  college  teachers  ad¬ 
ministered  by  The  College  Board  to  identify  skills  teachers  thought 
were  most  important  for  students  entering  college,  respondents 
rated  identifying  the  theme  of  a  text  and  making  inferences  and 
drawing  conclusions  as  the  most  important  reading  skills,  and 
writing  a  clear,  coherent  essay  using  supporting  details  as  the  most 
important  writing  skill  students  should  possess  (Milewski,  John¬ 
son,  Glazer,  &  Kubota,  2005).  Despite  its  importance,  many  stu¬ 
dents  have  an  inadequate  grasp  of  what  a  theme  is.  For  this  reason, 
the  prompts  for  the  Year  1  AW  A  (2012-2013)  focused  on  analyz¬ 
ing  theme  in  two  works  of  fiction,  the  short  stories  “The  Medicine 
Bag”  by  Virginia  Driving  Hawk  Sneve  (1991)  and  “The  Scarlet 
Ibis”  by  James  Hurst  (1960),  one  at  pretest  and  one  at  posttest. 
Order  effects  were  controlled  by  counterbalancing  the  two  timed 
writing  assessments  across  classrooms  each  year.  In  Year  2,  teach¬ 
ers  received  additional  training  in  using  a  cognitive  strategies 
approach  to  teaching  informational  texts  and  literary  nonfiction, 
genres  prioritized  in  the  CCSS-ELA,  which  students  may  have  had 
less  experience  studying  and  writing  about  in  their  English  lan¬ 
guage  arts  classes.  The  Year  2  AWA  prompts  required  students 
to  analyze  two  newspaper  articles  “Sometimes,  the  Earth  is 
Cruel”  by  Leonard  Pitts  (2010)  and  “The  Man  in  the  Water”  by 
Roger  Rosenblatt  (1982),  develop  a  theme  statement  about  the 
author’s  message,  and  evaluate  the  author’s  purpose  for  writing 
each  article. 

Pretest  essays  were  used  for  both  formative  and  summative 
purposes.  Based  on  the  teachers’  analysis  of  students’  pretest 
essays,  lessons  were  implemented  to  address  students’  needs  rel¬ 
ative  to  the  CCSS-ELA.  For  example,  Paper  1C  124  in  Appendix 
A,  an  exploration  of  theme  in  “The  Medicine  Bag”  written  by  an 
eighth-grade  Latina  reclassified  EL  student  in  October,  contains  a 
number  of  areas  for  improvement  that  are  typical  of  the  pretests  we 
received.  First,  the  paper  is  written  as  one  long  paragraph  without 
a  formal  introduction,  main  body,  and  conclusion.  Graham  and 
Perin  (2007)  note  that  “teaching  adolescents  strategies  for  plan¬ 
ning,  revising,  and  editing  their  compositions  has  shown  a  dra¬ 
matic  effect  on  the  quality  of  students’  writing”  (p.  16).  During  the 
revision  activities,  students  were  taught  a  planning  strategy,  called 
“HoT  S-C  Team”  (see  Figure  3),  to  lead  into  their  essays  with  a 


hook — a  quote,  question,  description,  or  statement  to  make  people 
think,  followed  by  a  “TAG”  (title,  author,  and  genre),  a  statement 
summarizing  the  conflict,  and  a  thesis  containing  their  theme 
statement. 

Although  this  strategy  might  be  perceived  as  too  formulaic  for 
advanced  writers,  novice  essay  writers  need  to  be  exposed  to 
form -making  before  engaging  in  form-breaking.  Students  were 
also  introduced  to  transition  words  to  develop  coherence  and  use 
as  a  bridge  to  their  main  body  and  conclusion  and  practiced 
inserting  appropriate  transition  words  into  cloze  exercises  to  be¬ 
come  familiar  with  the  use  and  pqsition  of  words  like  in  addition , 
however ,  nevertheless,  whereas,  on  the  other  hand,  and  so  forth. 

Another  very  common  shortcoming  in  pretest  essays  was  the 
overreliance  on  summarizing.  As  has  been  widely  reported,  ELs 
who  have  been  in  English  Language  Development  programs  often 
receive  instruction  that  focuses  primarily  on  literal  comprehension. 
Consequently,  they  tend  to  rely  on  retelling  when  writing  a  text- 
based  analytical  essay  as  a  way  to  prove  that  they  understood  what 
they  read  rather  than  offering  interpretation  and  commentary  to 
support  their  argument.  As  Scardamalia  and  Bereiter  (1987)  point 
out,  novice,  inexperienced,  and  struggling  writers  use  a  simplified 
version  of  the  idea-generation  process  they  call  knowledge-telling, 
which  consists  of  retrieving  information  from  long-term  memory 
and  converting  the  writing  task  into  simply  regurgitating  what  is 
known  about  a  topic.  More  experienced  writers,  on  the  other  hand, 
engage  in  a  complex  composing  process  known  as  knowledge 
transformation,  in  which  they  analyze  the  writing  task  and  plan 
what  to  say  and  how  to  say  it  in  accordance  with  rhetorical, 
communicative,  and  pragmatic  constraints.  Noting  that  students 
whose  text  production  system  corresponds  to  the  knowledge¬ 
telling  model  “need  more  than  encouragement  to  revise”  (p.  156), 
Scardamalia  and  Bereiter  suggest  that  they  need  modeling,  lessons 
in  planning  and  writing  for  an  audience,  and  “insight  into  their  own 
composting  processes”  (p.  165).  One  way  to  help  students  move 
from  knowledge-telling  to  knowledge  transformation  is  to  help 
them  make  their  thinking  visible  after  they  have  composed  a  first 
draft  of  an  essay  using  a  color-coding  process. 

Teachers  first  designated  three  colors  for  the  types  of  assertions 
that  comprise  a  text-based  analytical  essay  and  said  the  following: 

Plot  summary  reiterates  what  is  obvious  and  known  in  a  text.  Reiterate 
means  to  repeat  in  order  to  make  something  very  clear.  Plot  summary 
is  yellow  because  it’s  like  the  sun.  It  makes  things  as  plain  as  day.  We 
need  some  plot  summary  to  orient  our  reader  to  the  facts,  but  we  do 
not  need  to  retell  the  entire  story.  Commentary  is  blue  like  the  ocean 
because  the  writer  goes  beneath  the  surface  of  things  to  look  at  the 
deeper  meaning  to  offer  opinions,  interpretations,  insights,  and  “Ah- 
Ha’s.”  Supporting  detail  is  green  because  like  the  color,  it  brings 
together  the  facts  of  the  text  (yellow)  with  your  interpretation  of  it 
(blue).  It  is  what  glues  together  plot  summary  and  commentary.  It’s 
your  evidence  to  support  your  claims,  including  quotations  from  the 
text. 

* 

The  next  step  was  to  model  the  process  of  color  coding.  Because 
half  of  the  students  in  the  study  wrote  to  one  prompt  and  half  wrote 
to  the  other  at  pretest,  we  selected  a  third  text  as  a  training  text.  In 
Year  1,  “The  Horned  Toad”  by  Gerald  Haslam  (1995)  was  used  as 
the  training  text.  The  sample  paragraph  about  “The  Homed  Toad” 
illustrates  how  one  codes  for  summary,  supporting  detail,  and 
commentary: 
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?  How  Do  I  Begin  ? 

.v^The  Introduction  to  Your  Interpretive  Essays _ 

4  Parts:  HoT  S-C  T 

_ (HoT  S-C  Team)  =  (Hook/TAG/Story-Conflict/Ihesis _ 

©ijjook:  Begin  your  introductory  paragraph  with  an  attention  grabber  or  "hook"  to 
capture  the  reader's  interest.  It  might  include  one  of  the  following: 

•  Opening  with  an  exciting  moment  from  the  story 

•  An  interesting  description 

•  Dialogue 

•  Quotation  from  the  text 

•  A  statement  to  make  people  think 

•  An  anecdote  (a  brief  story) 

•  A  thought-provoking  question  (a  question  that  makes  people  think) 

@[t]AG:  Follow  the  "hook"  with  a  TAG  (title/author/genre=type  of  literature  such  as  short 
story,  narrative,  novel,  play,  poem)  that  identifies  all  three  parts  of  TAG  for  the  reader. 
®[s]ummary  Statement-Conflict:  As  a  part  of  the  TAG,  or  right  after  the  TAG,  include  a  brief 
summary  of  the  story  and  its  conflict.  Usually  two  or  three  sentences  are  enough  to  give 
background  information  to  the  reader  about  the  story  and  the  conflict. 

©@hesis  Statement:  The  thesis  statement  in  an  essay  is  the  claim  the  writer  makes  in 
response  to  the  prompt.  The  thesis  statement  is  the  "key"  that  will  "drive"  your  essay. 

Do  people  go  on  a  trip  with  no  idea  of  where  to  go?  No,  they  look  at  a  map  or  check  the 
Internet  for  driving  directions.  Your  job  as  a  writer  is  to  "map"  your  essay  for  the  readers. 
Tell  the  reader  where  you  will  take  them. 

Figure  3.  HoT  S-C  Team.  Source:  Olson,  Scarcella,  and  Matuchniak  (2013).  Reprinted  with  permission. 


When  the  narrator’s  homed  toad  was  crushed  on  the  pavement, 
Grandma  consoles  him  over  the  loss  of  her  pet.  (Yellow) 

She  joins  him  in  grieving  and  strokes  his  back  and  then  she  picks  up 
the  homed  toad  and  mutters,  “The  poor  little  beast.”  (Green) 

Like  the  homed  toad,  Grandma  is  also  out  of  place.  (Blue) 

The  homed  toad  is  symbolic  of  Grandma  because  they  are  kindred 
spirits  who  need  to  return  to  where  they  belong.  (Blue) 

After  students  were  introduced  to  the  color-coding  system,  they 
practiced  coding  sample  essays  on  theme  in  “The  Homed  Toad” 
that  were  marginal/not  pass  (1  to  3  on  a  6-point  scale)  and 
adequate  to  strong  pass  (4  to  6  on  a  6-point  scale).  Starting  with  the 
weaker  paper,  students  noticed  that  most  of  the  sentences  fell  into 
the  yellow  category,  whereas  the  stronger  paper  had  a  balance  of 
yellow,  green,  and  blue.  Students  then  applied  the  color-coding 
strategy  to  their  own  first  drafts  to  visibly  see  whether  they  had 
simply  summarized  or  whether  they  had  provided  ample  textual 
evidence  and  commentary.  The  coded  draft  then  became  a  visible 
guide  for  revision. 

At  pretest,  a  majority  of  the  students  in  the  study  had  difficulty 
distinguishing  between  a  topic  and  a  theme  and  tended  to  use  topic 
words  like  death ,  loss,  or  selfishness  to  identify  the  theme  in  “The 
Scarlet  Ibis,”  or  family,  relationships,  or  tradition,  as  the  student  in 
paper  1024  does,  to  describe  theme  in  “The  Medicine  Bag,”  rather 


than  to  present  a  well-developed  theme  statement,  or  they  failed  to 
discuss  the  theme  altogether.  To  help  students  to  distinguish  between 
a  topic  and  a  theme,  teachers  first  provided  the  following  explanation: 

A  story’s  theme  is  different  from  its  topic  or  subject.  The  topic  is 
simply  what  it’s  about.  The  theme  is  the  author’s  point  about  a  topic. 
Think  of  a  topic  as  the  What  of  the  story  and  the  theme  as  the  So 
What?  To  identify  a  theme,  sometimes  it  helps  to  brainstorm  a  list  of 
topics  or  big  ideas  in  a  story.  Common  topics  for  themes  that  you’ll 
find  in  stories  are  usually  abstract  nouns  that  deal  with  human  rela¬ 
tionships  and  include  terms  like  belonging,  courage,  family,  friend¬ 
ship,  hope,  identity,  prejudice,  respect,  revenge,  and  trust.  A  theme 
statement  must  be  a  complete  sentence  that  states  the  author’s  mes¬ 
sage  about  life  or  about  human  relationships,  such  as  “Loss  can  bring 
people  together.”  A  good  theme  statement  applies  to  people  in  gen¬ 
eral,  not  just  to  the  specific  characters  in  the  story. 

To  help  students  practice  developing  theme  statements,  teachers 
showed  several  short  video  clips  from  familiar  movies  featuring 
teenage  protagonists  in  situations  in  which  a  theme  could  be 
derived.  For  instance,  in  the  Disney/Pixar  animated  film  Brave, 
Princess  Merida  defies  an  age-old  custom  to  be  betrothed  to  her 
future  husband  through  an  arranged  marriage  orchestrated  by  her 
parents.  In  the  clip,  during  an  archery  contest  in  which  various 
suitors  compete  to  win  her  hand  in  marriage,  Merida  grabs  the  bow 
and  shoots  with  better  precision  than  any  of  the  would-be  grooms. 
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Students  generated  the  following  topics  and  theme  statements  in 
response  to  the  clip. 

Topics:  Tradition,  heritage,  culture,  chivalry,  defiance,  marriage, 
authority,  disobedience 

Theme  statements: 

•  Sometimes  it’s  necessary  to  challenge  the  traditions  of  one’s  heri¬ 
tage. 

•  Control  of  your  life  is  in  your  own  hands. 

•  Women  should  not  be  considered  second-class  citizens  and  should 
have  the  freedom  to  choose  the  person  they  want  to  marry. 

Thinking  of  the  topic  as  the  “What”  and  the  theme  as  the  “So 
What?”  and  practicing  with  the  video  clips  seemed  to  turn  on  a 
light  bulb  for  students  and  helped  them  to  create  a  more  definitive 
theme  statement  to  support  with  textual  evidence. 

Another  area  in  which  students  needed  support  during  the  revi¬ 
sion  process  was  the  use  of  academic  language.  Academic  lan¬ 
guage  is  a  key  component  of  effective  instruction  for  ELs  (Gersten, 
Baker,  Shanahan,  Linan-Thompson,  &  Collins,  2007;  Goldenberg, 
2013;  Rivera  et  al.,  2010;  Short  &  Fitzsimmons,  2007)  and  the 
explicit  instruction  of  academic  English  for  all  students  figures 
prominently  in  the  CCSS-ELA.  In  analyzing  students’  pretests,  we 
found  that,  by  and  large,  they  used  an  informal  and  conversational 
register  of  English  rather  than  the  more  formal,  academic  register 
called  for  in  an  analytical  essay.  In  other  words,  they  used  a  “think 
it/say  it”  type  of  “unretouched”  and  “underprocessed”  expression 
of  thought  written  by  a  writer  to  him/herself  for  him/herself 
diagnosed  by  Flower  (1979)  as  writer-based  prose  rather  than  a 
more  deliberate  and  mature  structure  and  style  adapted  to  a  reader, 
or  reader-based  prose  (p.  19).  These  more  informal  linguistic 
features  include  general  words  like  nice,  story,  or  man  instead  of 
more  academic  or  technical  ones  like  compassionate,  narrative,  or 
novelist4,  slang  like  stuff,  guy,  and  blown  away ;  inappropriate 
hedges  like  kind  of  and  sort  of  (as  the  author  of  Paper  1C124  does 
when  she  says  that  Martin  was  “kinda  embarresed”  [sic]  of  his 
grandfather);  informal  expressions  and  markers  of  spoken  English 
like  ya  know  and  by  the  way4,  needless  repetition;  misspellings; 
grammatical  errors;  contractions;  poorly  linked  sentences  with  an 
absence  of  clear  referents;  the  unnecessary  use  of  I  (e.g.,  I  think,  I 
believe,  and  In  my  opinion,  when  it  is  clear  that  the  student  is 
offering  an  opinion),  and  simple  rather  than  complex  sentences 
with  little  sentence  variety. 

To  help  students  become  more  conscious  of,  and  adept  at, 
writing  in  a  register  of  English  suited  to  a  formal  audience,  we 
seamed  together  a  “typical”  pretest  essay  written  in  informal 
English  and  then  rewrote  the  same  paper  using  more  formal 
academic  language.  Students  were  instructed  to  work  in  pairs  to 
create  a  chart  noting  the  features  of  each  essay  that  made  it 
informal  or  formal.  Next,  the  partners  were  given  the  following 
informal  passage  and  instructed  to  revise  all  of  the  italicized  words 
(which  also  included  some  common  editing  errors)  to  make  them 
sound  more  academic: 

An  Informal  Passage  to  Improve 

The  boy  in  the  story  really  didn’t  like  his  Great-Grandma  when 
they  first  met.  She  barked  orders  at  him  and  everyone  else.  She  said 
scary  things  about  his  horned  toad,  and  his  mother  made  him  take 
it  back  were  he  found  it.  Grandma  was  sorta  mean  I  think.  Like  she 


called  him  el  malcriado,  the  spoiled  boy.  It  seems  that  the  boy 
kinda  started  to  like  his  Great-Grandma  when  she  gave  him  money 
for  candy  and/or  sure  when  she  helped  him  bury  the  horned  toad. 
Maybe  he  would  of  liked  her  in  the  beginning  if  she  had  talked 
English  to  him.  Cause  she  didn’t  try  in  the  beginning,  he  didn’t 
either.  A  theme  for  this  story  might  be  something  like  “Never  judge 
a  book  by  its  cover.”  This  means  wait  til  you  get  to  know  someone 
before  you  make  a  judgment  about  whether  you  like  or  dislike 
them. 

Appendix  B  includes  the  posttest  on-demand  essay  written  by 
Student  1C  124  in  May,  after  the  revision  training.  Note  that  her 
paper  opens  with  a  hook,  followed  by  a  TAG  (title,  author, 
genre)  and  a  theme  statement;  has  a  clear  introduction,  main 
body,  and  conclusion;  includes  a  blend  of  summary,  evidence, 
and  commentary;  uses  transition  words  and  academic  words 
such  as  narrator  and  quotes,  as  well  as  a  sentence  variety 
strategy  we  taught  called  adjectives  out  of  order  (Noden,  2011). 
Although  the  paper  still  contains  some  errors  in  correctness,  the 
growth  in  the  student’s  academic  writing  is  substantial. 

Coaching.  The  third  core  component  of  the  Pathway  Project 
involves  coaching.  Throughout  Year  1,  teachers  received  ongo¬ 
ing  support  from  a  retired  veteran  teacher  who  had  previous 
exposure  to  the  project  during  an  earlier  8-year  quasi- 
experimental  study  in  the  SAUSD  (Olson  &  Land,  2007).  These 
coaches  conducted  three  informal,  nonevaluative  classroom  ob¬ 
servations  and  provided  teachers  with  detailed  written  feedback 
identifying  areas  of  strength,  areas  for  improvement,  and  spe¬ 
cific  suggestions  of  classroom  practices  teachers  could  imple¬ 
ment.  During  Year  2,  a  lead  English  language  arts  teacher  from 
each  school  who  had  been  groomed  during  Year  1  of  the 
training  assumed  the  coaching  role.  Pathway  Project  coaches 
attended  professional  development  trainings  along  with  the 
school  team  to  whom  they  were  assigned  and  assisted  teachers 
in  integrating  interpretive  reading  and  analytical  writing  in¬ 
struction  using  the  cognitive  strategies  approach  into  the  les¬ 
sons  in  their  textbook.  Research  indicates  that  when  coaching  is 
combined  with  professional  development,  teachers  are  more 
likely  to  implement  innovations  in  their  classroom  (Buly, 
Coskie,  Robinson,  &  Egawa,  2006;  Joyce  &  Showers,  2002; 
Olson  &  Land,  2008). 

Description  of  business-as-usual  professional  development 
activities.  In  contrast,  control  teachers  conducted  business  as 
usual,  using  the  district  English  language  arts  textbook  and  core 
novels  for  teaching.  Both  groups  attended  one  full  day  of 
professional  development  led  by  district  curriculum  specialists 
on  protocols  for  reviewing  district  benchmark  assessments.  In 
Year  2,  in  addition  to  district  benchmarks,  text  complexity  was 
also  addressed  by  a  district  curriculum  specialist. 

Data  Sources  and  Collection 

Data  collection  took  place  from  September  2012  until  June  of 
2014.  Data  sources  included  existing  school  district  records  of 
student  demographic  information  (gender,  race/ethnicity,  FRPL, 
and  EL  status),  and  outcome  measures  including  pre-  and  posttest 
on-demand  AWAs  administered  by  the  participant  teachers  and  the 
California  High  School  Exit  Exam  (CAHSEE). 
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Measures  and  Procedures 

AWA.  The  AWA  is  a  measure  developed  for  the  Pathway 
Project  to  assess  the  analytical  writing  skills  of  secondary  school 
students.  In  October  of  each  year,  students  in  the  classes  of 
Pathway  and  control  teachers  wrote  a  timed,  on-demand,  text- 
based  academic  essay  interpreting  a  theme  in  one  of  two  literary 
selections  (“The  Scarlet  Ibis”  and  “The  Medicine  Bag”  in  Year  1) 
and  one  of  two  nonfiction  newspaper  articles  (“Sometimes,  the 
Earth  is  Cruel”  and  “The  Man  in  the  Water”  in  Year  2)  to 
determine  their  growth  as  academic  writers  over  time.  As  men¬ 
tioned  previously,  order  effects  were  controlled  by  counterbalanc¬ 
ing  the  two  timed-writing  assessments  across  classrooms  in  each 
year.  Essays  were  organized  by  classrooms  and  then  randomly 
assigned  to  trained  raters.  Each  rater  scored  essays  holistically  on 
a  6-point  scale  to  assess  the  quality  and  depth  of  the  interpretation, 
the  clarity  of  the  thesis,  the  organization  of  ideas,  the  appropriate¬ 
ness  and  adequacy  of  the  evidence,  sentence  variety,  and  the 
correct  use  of  English-language  conventions.  Our  rubric  for  scor¬ 
ing  the  AWA  was  based  on  those  used  to  evaluate  the  essay 
portion  of  the  CAHSEE  (California  Department  of  Education, 
2008a),  the  California  STAR  7  Direct  Writing  Assessment  (Cali¬ 
fornia  Department  of  Education,  2008b),  and  the  NAEP  (ACT, 
Inc.,  2007).  Each  paper  received  two  ratings,  yielding  a  combined 
score  ranging  from  2  (1  -I-  1)  to  12  (6  +  6),  with  all  papers  in 
which  raters  disagreed  by  more  than  1  point  scored  by  a  third  rater. 
In  total,  there  were  60  raters.  Each  essay  was  independently  scored 
by  two  raters,  with  52%  exact  agreement  and  94%  within  1 -point 
agreement  in  Year  1,  and  47%  exact  agreement  and  93%  within 
1 -point  agreement  in  Year  2.  Discrepancies  of  more  than  1  point 
were  resolved  by  taking  the  average  score  of  the  first  two  raters 
and  then  summing  this  number  with  the  third  rater’s  score.  For 
example,  if  the  first  two  raters  assigned  a  score  of  2  and  4,  and  the 
third  rater  assigned  a  score  of  3,  the  final  score  was  6.  Approxi¬ 
mately  6%  of  the  papers  were  scored  by  a  third  rater. 

CAHSEE.  In  anticipation  of  the  Smarter  Balanced  Assess¬ 
ment,  California  elected  not  to  administer  the  CST  in  2014.  How¬ 
ever,  the  CAHSEE  was  administered  to  all  10th  graders  during 
2013  and  2014.  Tenth  graders  first  sit  for  this  exam  during  March 
of  each  year.  The  CAHSEE  is  a  standardized  literacy  assessment 
that  is  aligned  with  the  California  English/Language  Arts  aca¬ 
demic  content  standards  through  Grade  10  and  is  a  graduation 
requirement  for  all  high  school  students.  The  test  consists  of 
multiple  choice  questions  and  a  timed  writing  task.  The  reading 
portion  includes  vocabulary,  reading  comprehension,  and  analysis 
of  information  and  literary  texts.  The  writing  portion  covers  writ¬ 
ing  strategies,  applications,  and  conventions.  The  writing  task  is 
randomly  assigned  at  each  test  sitting  and  calls  for  students  to 
write  a  response  to  literature,  a  biography,  an  expository  piece,  or 
a  persuasive  essay.  CAHSEE  scale  scores  range  from  275  to  450. 
A  scale  score  of  350  or  higher  is  needed  to  pass  the  CAHSEE. 

Fidelity  of  implementation  measure.  To  assess  the  fidelity 
of  implementation,  we  used  the  Pathway  Project  Quality  Checklist, 
an  instrument  specifically  aligned  with  our  intervention.  This 
measure  relies  on  the  research  base  on  effective  literacy  instruc¬ 
tion,  including  principles  that  appear  to  enhance  student  learning 
across  various  models  and  curricula.  In  their  study  of  primary 
grade  reading  instruction,  Foorman  et  al.  (2006)  have  noted  wide 
variance  in  the  quality  with  which  teachers  implement  curricula, 


but  also  found  that  scores  on  a  Likert  rating  form  were  linked  to 
literacy  outcomes.  The  teacher-focused  measure  includes  a  6-point 
Likert  scale  (1  =  little  or  weak  evidence  to  6  =  impressive 
evidence)  on  four  items  that  the  observer  uses  to  rate  the  degree  to 
which  the  teacher  (a)  demonstrates  knowledge  of  subject  matter, 

(b)  delivers  a  lesson  that  is  appropriate  for  the  needs  of  students, 

(c)  engages  students,  and  (d)  uses  language  arts  strategies  that  are 
consistent  with  the  Pathway  Project.  The  student-focused  measure 
includes  the  same  6-point  Likert  scale  on  two  items  that  the 
observer  uses  to  rate  the  degree  to  which  the  students  (a)  exhibit 
command  of  reading  and  writing  strategies,  and  (b)  are  on  task  and 
engaged  in  the  lesson. 

The  fidelity  of  implementation  measure  was  part  of  a  more 
extensive  form  that  we  used  to  guide  teacher  observations  during 
both  years  of  the  intervention.  During  Year  1,  we  observed  91  of 
the  participating  teachers  (Pathway  and  control)  for  one  class 
period  (46  min)  at  two  time  points  (Winter  2012  and  Spring  2013). 
During  Year  2,  74  of  the  participating  teachers  (all  of  whom  were 
also  observed  in  Year  1)  were  observed  following  the  same  pro¬ 
cedure  as  Year  1.  We  used  the  same  three  trained  raters  both  years 
who  were  blinded  to  teachers’  status  as  Pathway  or  control.  We 
established  interobserver  reliability  on  our  instrument  by  having 
pairs  of  raters  rate  20%  of  the  classrooms  on  all  six  Likert  ratings 
each  year.  We  found  that  across  the  six  Likert  ratings  we  had 
perfect  agreement  in  86%  of  the  sets  of  observations.  Only  14%  of 
the  Likert  ratings  had  any  disagreement,  and  they  were  in,  at  most, 
one  of  the  six  observed  Likert  ratings  in  an  observation  session. 
We  take  this  as  strong  evidence  that  we  have  a  reliable,  low- 
inference  observation  tool.  Any  discrepancies  were  resolved 
through  discussion  between  observers. 

To  determine  whether  there  was  observable  growth  in  the  use  of 
Pathway  Project  strategies  among  teachers  who  underwent  the 
training,  we  compared  the  observation  data  of  Pathway  and  control 
teachers  who  were  observed  at  both  time  points  (Winter  and 
Spring)  during  both  years  of  the  program  (Year  1  and  Year  2).  In 
our  Year  1  study  (2012-2013),  there  was  not  a  significant  differ¬ 
ence  between  control  teachers  (M  =  3.94,  SD  =  1.21)  and  Path¬ 
way  teachers  (M  =  4.39,  SD  =  1.02)  in  their  use  of  language  arts 
strategies  and  activities,  consistent  with  the  intervention  at  the  first 
observation  conducted  in  Winter  2012  (p  =  .10;  see  Table  3).  On 
the  final  observation,  there  was  again  not  a  significant  difference 
[p  =  .18)  in  the  time  spent  by  Pathway  teachers  and  control 
teachers  in  their  implementation  of  Pathway  Project-specific  strat¬ 
egies  and  activities.  At  the  first  observation  in  Year  2,  there  was 
not  a  significant  difference  ip  =  .29)  between  control  teachers  and 
Pathway  teachers.  At  the  final  observation,  however,  Pathway 
teachers  (M  =  5.19,  SD  =  0.67)  implemented  Pathway  Project- 
specific  strategies  and  activities  at  a  significantly  higher  rate  than 
the  control  teachers  (M  =  4.29,  SD  =  1.30,  p  <  .01).  A  similar 
pattern  occurred  when  observers  scored  teachers  on  the  extent  to 
which  their  students  demonstrate  command  of  and  effective  use  of 
strategies.  During  the  Year  1  Winter  and  Spring  observations,  and 
during  the  Year  2  Winter  observation,  there  were  no  significant 
differences  between  Pathway  and  control  teachers’  students’ 
scores  (see  Table  3).  By  the  Year  2  Spring  observation,  there  was 
a  significant  difference  between  the  two  groups  ip  <  .05).  The 
growth  that  occurred  over  time  for  the  Pathway  teachers’  students’ 
on  both  measures  suggests  that  teachers  needed  prolonged  expo- 


12 


OLSON,  MATUCHNIAK,  CHUNG,  STUMPF,  AND  FARKAS 


Table  3 

Comparison  of  Pathway  and  Control  Groups  on 
Observation  Measures 


Control3  Pathway13 


Observation  Measures 

M 

SD 

M 

SD 

t 

P 

Teacher’s  use  of  pathway 
strategies 

Year  1  Winter 

3.94 

1.21 

4.39 

1.02 

-1.67 

.10 

Year  1  Spring 

4.35 

1.38 

4.75 

.97 

-1.37 

.17 

Year  2  Winter 

4.32 

1.33 

4.64 

1.10 

-1.07 

.29 

Year  2  Spring 

4.29 

1.30 

5.19 

.67 

-3.66 

.00** 

Students’  use  of  pathway 
strategies 

Year  1  Winter 

2.93 

1.80 

3.34 

1.72 

-.89 

.38 

Year  1  Spring 

3.26 

2.07 

3.55 

1.94 

-.55 

.59 

Year  2  Winter 

3.96 

1.26 

4.24 

.99 

-.93 

.36 

Year  2  Spring 

3.96 

1.26 

4.59 

.82 

-2.21 

.03* 

Note.  Only  teachers  who  were  observed  at  all  four  time  points  were 
included  in  these  analyses. 
a  27  teachers.  b  29  teachers. 

*  p  <  .05.  *><.01.  **><.001. 

sure  to  and  practice  with  the  strategies  in  order  to  fully  implement 
them. 

Data  Analytic  Strategy 

Separate  calculations  were  undertaken  to  estimate  program  ef¬ 
fects  for  the  first  and  second  implementation  years  in  the  AUHSD. 
To  measure  the  impact  of  the  Pathway  Project  on  student  writing 
achievement  as  measured  by  the  AWA,  we  undertook  two  sets  of 
calculations.  The  first  compared  the  pretest  to  posttest  gains  of 
Pathway  and  control  groups  for  the  total  analysis  sample  and  for 
demographic  subgroups.  The  second  used  the  following  gain  score 
regression  analysis  model: 

Y2— Y)  =  a  +  (3]  Pathway  +  (32Controls  +  e.  (1) 

In  this  model,  Y2  represents  the  student’s  posttest  essay  score,  Yj 
represents  the  student’s  pretest  essay  score,  Pathway  is  a  dummy 
variable  for  placement  in  the  Pathway  or  control  group,  and 
controls  include  gender,  race,  grade,  language  proficiency  status, 
and  FRPL  eligibility.  The  models  also  included  tests  for  interaction 
between  the  Pathway  variable  and  both  the  ethnic/race  and  lan¬ 
guage  subgroups.  Unstandardized  coefficients  were  used  for  each 
of  the  numeric  variables,  and  the  program  effect  size  (Cohen’s  d) 
was  calculated  by  dividing  the  Pathway  coefficient  by  the  pooled 
standard  deviation  of  the  pretest.  The  model  was  estimated  in 
STATA  13  using  the  cluster  command  with  the  commands  regress 
and  logistic  (Rogers,  1993;  Williams,  2000).  This  methodology 
adjusts  the  estimated  standard  errors  using  the  Huber- White  “sand¬ 
wich”  estimator  (Huber,  1967;  White,  1980)  to  correct  the  standard 
errors  for  clustering  of  students  into  the  classes  of  Pathway  and 
control  teachers.  It  has  been  shown  that  when  there  are  more  than 
20  clusters  (as  is  the  case  in  this  study),  this  technique  produces 
similar  standard  errors  to  those  estimated  from  a  multilevel  model 
(Arceneaux  &  Nickerson,  2009). 

The  CAHSEE  provided  another  opportunity  to  measure  the 
impact  of  the  Pathway  Project  on  teachers  and  students.  The  results 
for  the  students  were  reported  by  the  district.  Students  were  coded 


as  passing  (1),  failing  (0),  or  did  not  yet  take  (2).  For  analyses  in 
each  year,  the  10th  grade  students’  performances  on  the  CAHSEE 
were  analyzed.  We  did  not  analyze  the  1 1th  or  12th  grade  students 
because  the  spring  of  1 0th  grade  is  when  students  first  attempt  to 
take  the  CAHSEE.  The  numbers  of  students  retaking  the  test  in 
1 1th  and  12th  grade  are  much  smaller  and  are  subject  to  selection 
bias.  As  with  the  AWA  analyses,  we  first  compared  the  outcomes 
for  Pathway  and  control  students  overall  and  within  each  of  the 
demographic  subgroups.  Following  this,  we  ran  logistic  regression 
analyses  predicting  passing  the  CAHSEE  using  models  which 
included  the  group  of  control  variables  and  tested  for  interactions 
between  the  Pathway  coefficient  and  both  the  race/ethnicity  and 
language  groups.  As  with  the  AWA  regressions,  the  models  were 
estimated  in  STATA,  and  the  Huber- White  “sandwich”  estimator 
was  used  to  adjust  for  the  clustering  of  students  into  teachers’ 
classrooms.  The  estimated  model  is 

log(p/l-p)  =  a  +  (3,  Pathway  +  p2Controls.  (2) 

Results 

Results  for  the  AWA 

Table  4  shows  Year  1  pre-  and  posttest  scores  on  the  AWA  for 
Pathway  and  control  students.  These  are  measured  on  the  AWA 
scale  (from  2  to  12).  Pathway  students  gained  0.99  points  more 
than  control  students,  which  was  highly  statistically  significant. 
Significant  effects  were  attained  for  all  grade  levels  except  12th 
grade.  Program  effects  were  relatively  similar  for  males  and  fe¬ 
males,  although  slightly  larger  for  females.  Across  race/ethnic 
groups,  the  largest  positive  effects  are  for  Hispanics  and  Blacks 
(1.16  and  1.19,  respectively).  Those  for  Whites  and  Asians,  while 
also  statistically  significant,  are  about  half  those  magnitudes  (.58 
and  .51,  respectively).  Program  effects  are  positive  and  significant 
for  all  the  language  groups,  with  the  very  largest  occurring  for 
ELs.  At  a  magnitude  of  1.53,  this  effect  equals  the  pretest 
standard  deviation  for  EL  control  students.  These  large  gains  by 
ELs  suggest  that  Pathway  Project  instructional  strategies  may 
be  particularly  beneficial  for  students  still  in  the  process  of 
learning  English.  The  program  also  produced  significant  gains 
for  both  FRPL-eligible  students  and  those  who  were  not  eligi¬ 
ble.  The  effects  were  larger  for  the  lower  income  (FRPL- 
eligible)  students. 

Table  5  repeats  these  calculations  for  the  second  year  of  the 
program.  (To  be  clear,  Pathway  students  in  this  sample  are  in  their 
first  year  experiencing  the  program,  but  Pathway  teachers  are  in 
their  second  year  with  the  program.  Control  students  and  teachers 
have  never  experienced  the  program.)  Once  again,  there  is  a  large, 
positive,  significant  effect  of  the  program  for  the  full  sample.  As 
during  the  first  year,  positive  and  significant  effects  occur  for 
males  and  females,  and  for  Whites,  Hispanics,  and  Asians.  As 
before,  the  largest  gains  are  for  Hispanics,  and  at  1.47,  they  are 
larger  than  the  first-year  gains  for  this  group.  Significant  positive 
effects  are  found  for  all  four  language  groups,  with  the  largest 
occurring  for  RFEPS  (d  =  0.82,  p  <  .001).  Also  as  before, 
significant  effects  are  observed  both  for  students  who  are  and  are 
not  eligible  for  FRPL,  and  the  larger  effects  occur  for  those  who 
are  eligible  for  FRPL. 

Table  6  shows  these  program  effects  after  regression  adjustment 
for  the  control  variables.  For  each  program  year  the  first  model 
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Table  4 


Academic  Wiiting  Assessment  (AWA)  Mean  Scones  for  Year  1 


Groups 

Control 

Pathway 

Difference  in 
gains 

d 

n 

Pre 

Post 

Gain 

n 

Pre 

Post 

Gain 

All  students 

851 

5.69  (1.92) 

5.88  (2.08) 

.19**  (1.91) 

966 

5.72  (2.05) 

6.90  (2.18) 

1.18***  (1.91) 

gg*** 

.52 

7th  Grade 

164 

4.88  (1.75) 

5.45  (1.83) 

.56**  (1.79) 

167 

4.46(1.20) 

5.54(1.85) 

1.08***  (1.88) 

.52* 

.30 

8th  Grade 

170 

5.01  (1.54) 

5.51  (1.88) 

.50***  (1.82) 

211 

5.25  (1.80) 

6.89  (2.28) 

1.64***  (1.84) 

1.14*** 

.74 

9th  Grade 

130 

5.55  (1.69) 

5.36  (1.82) 

-.18(2.03) 

153 

5.83  (2.34) 

6.75  (2.14) 

.92***  (1.73) 

#** 

.66 

10th  Grade 

151 

6.36(1.76) 

6.17(2.00) 

-.19(1.82) 

172 

6.08  (1.78) 

7.60  (1.87) 

1.53***  (1.65) 

1  ' 1 

.98 

1 1th  Grade 

128 

6.32  (2.42) 

6.54  (2.24) 

.22(1.86) 

158 

6.57  (2.33) 

7.74  (2.21) 

1.17***  (2.00) 

95*** 

.39 

12th  Grade 

108 

6.46(1.60) 

6.58  (2.45) 

.12(2.12) 

105 

6.65  (1.92) 

6.88  (1.82) 

.23  (2.19) 

,ii 

n.s. 

Male 

423 

5.40(1.90) 

5.54  (2.09) 

.14(1.95) 

448 

5.53  (2.12) 

6.57  (2.18) 

1.05***  (1.83) 

g  |  *** 

.48 

Female 

428 

5.97  (1.89) 

6.22  (2.01) 

.25**  (1.87) 

518 

5.89  (1.98) 

7.18  (2.14) 

1.30***  (1.97) 

1.05*** 

.56 

White 

93 

5.86(1.81) 

6.25  (1.86) 

.39*  (1.86) 

90 

5.73  (1.86) 

6.70(1.97) 

.97***  (1.72) 

.58* 

.32 

Hispanic 

573 

5.52(1.95) 

5.58  (2.03) 

.06(1.92) 

645 

5.34(1.84) 

6.56  (2.10) 

1.22***  (1.95) 

1.16*** 

.59 

Asian 

148 

6.24(1.82) 

6.82  (2.19) 

.59***  (1.88) 

178 

7.21  (2.30) 

8.31  (2.03) 

1.10***  (1.82) 

.51* 

.28 

Black 

24 

5.75  (1.59) 

5.96  (1.60) 

.21  (1.84) 

30 

5.33  (1.94) 

6.73  (2.42) 

1.40**  (2.24) 

1.19* 

.75 

Native  American 

13 

5.08  (1.50) 

5.38  (1.80) 

.31  (1.97) 

22 

5.18(1.18) 

6.00(1.63) 

.82*  (1.68) 

.42 

n.s. 

ELL 

159 

4.84(1.50) 

4.53  (1.50) 

-.31*  (1.71) 

168 

4.32(1.23) 

5.54(1.75) 

1.22***  (1.63) 

1.53*** 

1.02 

RFEP 

408 

5.74(1.96) 

6.11  (2.09) 

.36***  (1.82) 

466 

6.02  (2.04) 

7.14  (2.08) 

1.12***  (1.98) 

.76*** 

.39 

IFEP 

57 

6.56  (2.07) 

7.19(2.35) 

.63*  (2.01) 

88 

6.73  (2.35) 

7.98  (2.20) 

1.25***  (2.04) 

,62t 

.30 

EO 

227 

5.96(1.86) 

6.10(1.89) 

.13(2.12) 

244 

5.75  (1.95) 

6.98  (2.21) 

1.23***  (1.92) 

1.10*** 

.59 

FRPL  eligible 

597 

5.56(1.98) 

5.68  (2.10) 

.13  (1.91) 

681 

5.42(1.90) 

6.63  (2.09) 

1.21***  (1.91) 

1.08*** 

.55 

FRPL  not  eligible 

254 

6.00(1.71) 

6.35  (1.95) 

.35**  (1.93) 

285 

6.43  (2.22) 

7.54  (2.26) 

1.11***  (1.90) 

.75*** 

.44 

Note. 


in  parentheses.  ELL  —  English  Language  Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially  Fluent  English  Proficient;  EO  = 
English  Only;  FRPL  =  eligible  for  Free  and  Reduced  Price  Lunch;  n.s.  =  Not  Significant. 

><.05.  *><.01.  **><.001. 

shows  the  Pathway  variable  alone,  the  second  adds  the  control  control  variables  in  the  equation,  the  Pathway  effects  continue  to 

variables  to  the  equation,  and  the  third  and  fourth  models  add  be  positive,  significant,  and  relatively  unchanged  in  magnitude 

Pathway  interactions  with  race/ethnicity  and  language  groups,  from  the  first  models.  For  the  first  implementation  year,  the  effect 

respectively.  The  second  models  for  each  year  show  that,  with  is  0.96  on  the  AWA  scale,  which  equals  an  effect  size  of  .48.  For 

Table  5 

Academic  Writing  Assessment  (AWA)  Mean  Scores  for  Year  2 

Groups 

Control 

Pathway 

Difference  in 
gains 

d 

n 

Pre 

Post 

Gain 

n 

Pre 

Post 

Gain 

All  students 

616 

5.11  (1.92) 

5.47  (2.04) 

.36***  (1.93) 

634 

5.48  (2.38) 

7.06  (2.23) 

1.58***  (2.00) 

1.22”* 

.64 

7  th  Grade 

171 

4.15(1.43) 

5.18  (1.98) 

1.01***  (1.83) 

152 

4.16(1.60) 

6.77  (1.75) 

2.61***  (1.93) 

1.61*” 

1.13 

8th  Grade 

138 

4.64(1.71) 

4.70(1.54) 

.07  (1.93) 

124 

5.10(2.49) 

6.42  (2.37) 

1.32***  (1.95) 

1.26*” 

.74 

9th  Grade 

59 

5.08(1.91) 

5.37  (2.12) 

.29  (1.85) 

94 

5.86  (2.23) 

6.64(1.93) 

.78***  (1.87) 

.49 

n.s. 

10th  Grade 

87 

5.03  (1.74) 

5.54(1.86) 

.51*  (1.81) 

109 

5.90  (2.78) 

7.37  (2.43) 

1.47***  (1.88) 

96*. 

.55 

11th  Grade 

117 

6.35  (1.64) 

6.15  (2.18) 

-.20  (2.10) 

85 

6.42  (2.16) 

7.88  (2.25) 

1.46***  (1.73) 

1.66*” 

1.01 

12  th  Grade 

44 

7.18  (1.98) 

7.23  (1.93) 

.05  (1.49) 

70 

6.71  (1.89) 

7.94  (2.30) 

1.23***  (2.04) 

1.18” 

.60 

Male 

318 

4.73  (1.84) 

5.07  (1.89) 

.34**  (1.97) 

308 

5.04  (2.30) 

6.70  (2.23) 

1.66***  (2.07) 

1  22*** 

.72 

Female 

298 

5.51  (1.92) 

5.89  (2.11) 

.38***  (1.89) 

326 

5.90  (2.39) 

7.40  (2.17) 

1.51***  (1.92) 

1  1 3*** 

.59 

White 

69 

4.90(1.70) 

5.71  (1.90) 

.80***  (1.90) 

66 

5.64  (2.23) 

7.23  (2.37) 

1.59  (2.23) 

.79* 

.46 

Hispanic 

402 

5.03  (1.95) 

5.30  (2.09) 

.27**  (1.90) 

384 

4.98  (2.09) 

6.72  (2.13) 

1.74***  (1.86) 

147*** 

.75 

Asian 

99 

5.90(1.84) 

6.05  (1.98) 

.15  (1.91) 

120 

7.37  (2.58) 

8.49(1.94) 

1.13***  (2.17) 

Qg*** 

.53 

Black 

19 

4.79(1.55) 

5.68  (2.06) 

.89  (2.40) 

18 

4.89  (2.00) 

6.11  (2.11) 

1.22*  (1.80) 

.33 

n.s. 

Native  American 

27 

4.15(1.54) 

5.11  (1.45) 

.96*  (1.91) 

46 

4.70(1.94) 

6.30(1.98) 

1.61***  (2.20) 

.65 

n.s. 

ELL 

114 

4.23  (1.63) 

4.40(1.52) 

.18(1.58) 

119 

4.20(1.42) 

5.55  (1.56) 

1.34**’  (1.60) 

1.16*** 

.71 

RFEP 

305 

5.42(1.87) 

5.65  (2.12) 

.23*  (1.92) 

295 

5.62  (2.36) 

7.38(2.12) 

1.76***  (2.06) 

1.53”* 

.82 

IFEP 

22 

5.91  (2.04) 

5.64(1.92) 

-.27  (2.19) 

42 

7.43  (2.87) 

8.43  (2.40) 

1.00***  (1.75) 

1.27** 

.62 

EO 

175 

5.04(1.97) 

5.82  (2.00) 

.78***  (2.07) 

178 

5.64  (2.39) 

7.24  (2.26) 

1.60***  (2.15) 

.82*** 

.42 

FRPL  eligible 

471 

5.02(1.94) 

5.31  (2.08) 

.29*’ (1.92) 

424 

4.92  (2.03) 

6.61  (2.08) 

1.70***  (1.92) 

1.41*** 

.73 

FRPL  not  eligible 

145 

5.40(1.81) 

5.98  (1.82) 

.58***  (1.97) 

210 

6.62  (2.62) 

7.97  (2.24) 

1.35***  (2.13) 

.43 

Note.  The  growth  of  1  point  on  the  AWA  12-point  scale  is  the  equivalent  to  half  a  letter  grade  (from  a  C  to  a  B-,  for  example).  Standard  deviations  are 
in  parentheses.  ELL  =  English  Language  Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially  Fluent  English  Proficient;  EO  = 
English  Only;  FRPL  =  eligible  for  Free  and  Reduced  Price  Lunch;  n.s.  =  Not  Significant. 

><.05.  *><.01.  **><.001. 
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Table  6 


Effects  of  Pathway  on  AWA  Gains  in  Year  1  and  Year  2  With  Interactions  by  Race  and  EL  Status 


Groups 

Year  1 

Year  2 

(1) 

(2) 

(3) 

(4) 

(A) 

(B) 

(C) 

(D) 

Pathway 

.99***  (.22) 

.96***  (.20) 

.58(31) 

1.10***  (.28) 

1.22**’  (.26) 

1.30’**  (.20) 

.87*  (.38) 

.85”  (.30) 

8th  Grade 

.23  (.34) 

.22  (.34) 

.23  (.34) 

-1.10*”  (.32) 

-1.10***  (.31) 

-1.10***  (.31) 

9th  Grade 

-.46(36) 

-.44(36) 

-.41  (.36) 

-1.32***  (.36) 

-1.31***  (.35) 

-1.33”*  (.36) 

10th  Grade 

-.15  (.29) 

-.15  (.30) 

-.12  (.29) 

-.83*  (.33) 

-.82*  (.32) 

-.85*  (.32) 

1 1th  Grade 

-.15(35) 

-.15  (.36) 

-.12(35) 

-1.21”  (.41) 

-1.19”  (.40) 

-1.21”  (.40) 

12th  Grade 

-.69  (.41) 

-.69  (.42) 

-.68  (.41) 

-1.21***  (.31) 

-1.22***  (.31) 

-1.22”*  (.31) 

Male 

-.17  (.10) 

-.17  (.10) 

-.17  (.10) 

.05  (.10) 

.06  (.10) 

.06  (.10) 

Hispanic 

.03  (.17) 

-.25  (.25) 

.04  (.17) 

-.13  (.19) 

-.48  (.29) 

-.15  (.19) 

Asian 

.13  (.22) 

.18(32) 

•  14  (.21) 

-.50*  (.25) 

-.64  (.40) 

-.52*  (.25) 

Black 

.20(31) 

-.08  (.42) 

.21  (31) 

-.09  (38) 

.21  (.52) 

-.11  (.39) 

Native  American 

-.31  (.42) 

.08  (.57) 

-.30  (.42) 

-.12(32) 

.33  (.44) 

-.11  (.32) 

ELL 

—.16  (.18) 

—  .14  (.18) 

-.32  (.26) 

-.43*  (.18) 

-.42*  (.18) 

-.60*  (.29) 

RFEP 

.11  (.14) 

.11  (.14) 

.28  (.17) 

-.11  (.18) 

-.11  (.18) 

-.47  (.25) 

IFEP 

.24  (.22) 

.29  (.22) 

.57  (.30) 

-.59*  (.23) 

-.60*  (.24) 

-1.00”  (.34) 

FRPL 

-.01  (.13) 

-.01  (.13) 

-.02  (.13) 

.06  (.16) 

.05  (.16) 

.05  (.16) 

Pathway  X  Hispanic 

.56  (.29) 

.65  (.37) 

Pathway  X  Asian 

-.07  (.40) 

.26  (.46) 

Pathway  X  Black 

.58  (.62) 

-.63  (.79) 

Pathway  X  Native  American 

-.60  (.76) 

-.73  (.59) 

Pathway  X  ELL 

.32  (.31) 

.34  (.38) 

Pathway  X  RFEP 

-.32  (.23) 

.74*  (.33) 

Pathway  X  IFEP 

-.56  (.39) 

.73  (.47) 

Constant 

.19  (.15) 

.36  (.29) 

.54  (.34) 

.27  (.32) 

.36  (.20) 

1.44”  (.42) 

1.65*”  (.46) 

1.68”*  (.44) 

n 

1,817 

1,817 

1,817 

1,817 

1,250 

1,250 

1,250 

1,250 

R2 

.062 

.090 

.095 

.094 

.089 

.165 

.171 

.171 

Note.  Standard  errors  are  in  parentheses.  These  are  adjusted  for  clustering  using  the  Huber-White  “sandwich”  estimator  in  STATA.  ELL  =  English 
Language  Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially  Fluent  English  Proficient;  FRPL  =  eligible  for  Free  and  Reduced  Price 
Lunch. 

>  <  .05.  *><.01.  **><.001. 


the  second  year,  the  effect  is  1.30  on  the  AWA  scale,  which  equals 
an  effect  size  of  .60. 

The  third  models  show  that  with  these  controls,  significant, 
positive  program  effects  are  found  for  Whites  in  Year  2  (the  base 
category  for  race/ethnicity),  with  a  larger,  marginally  significant 
program  effect  for  Hispanics  compared  with  Whites  in  both  Year 
1  and  Year  2.  This  is  one  of  the  most  important  findings  in  the 
study;  the  Pathway  Project  not  only  produces  significant  writing 
gains  for  White  students,  but  it  produces  larger  effects  for  Hispan¬ 
ics.  The  final  model  for  each  year  shows  that  the  program  pro¬ 
duced  significant  writing  gains  for  EO  students  in  both  program 
years,  as  well  as  larger  (and  significantly  different  from  EO) 
effects  for  RFEPs  in  the  second  year  of  the  program. 

Results  for  the  CAHSEE 

The  CAHSEE  is  first  administered  to  California  students  in  10th 
grade,  and  to  avoid  selection  bias  involving  students  who  do  not 
pass  it  this  first  time  and  take  it  again  at  a  later  grade,  we  restrict 
the  analysis  to  Pathway  and  control  students  in  the  10th  grade. 
Table  7  shows  these  results  for  the  first  year  of  the  program;  Table 
8  shows  results  for  the  second  year. 

Table  7  shows  that  in  the  first  program  year,  75.2%  of  controls 
and  83.7%  of  Pathway  students  passed  the  test,  for  an  overall 
significant,  positive  program  effect  of  8.5  percentage  points.  This 
effect  is  large  and  significant  for  males  (but  not  for  females),  for 
Hispanics,  and  for  EO  students.  (We  choose  to  ignore  the  finding 


for  Other  Race  students  because  the  sample  size  was  only  19 
students.)  The  program  effect  is  also  large  and  significant  for 
FRPL,  but  not  for  non-FRPL  students.  These  first-year  results— 
positive  program  effects  for  Hispanics,  EO,  and  FRPL  students — 
are  consistent  with  the  first-year  results  for  AWA  outcomes. 

Table  8  repeats  these  analyses  for  the  second  year  of  the  program. 
This  shows  a  very  large  effect  on  the  CAHSEE.  Fully  87.7%  of 
Pathway  10th  graders  passed  the  exam,  compared  with  only  69.3%  of 
the  controls,  yielding  a  program  effect  of  18.4  percentage  points,  more 
than  twice  that  achieved  in  Year  1.  Positive  program  effects  are 
observed  for  most  of  the  population  subgroups,  although  small  sample 
sizes  prevent  many  of  them  from  achieving  statistical  significance. 

Table  9  uses  multivariate  logistic  regression  to  adjust  for  control 
variables  (Models  2  and  B)  and  also  test  for  interactions  (Models  3,  4, 
C,  and  D).  The  table  shows  odds  ratio  coefficients,  with  t  statistics  in 
parentheses.  We  also  adjusted  the  standard  errors  for  clustering  using 
the  Huber- White  “sandwich”  estimator.  Unfortunately,  the  result  is 
that  the  estimated  program  effects  and  many  of  their  interactions  with 
race  and  language  now  fail  to  achieve  statistical  significance.  How¬ 
ever,  the  direction  and  magnitude  of  the  effects  are  supportive  of  the 
findings  from  Tables  6  and  7.  We  conclude  that  the  evidence  suggests, 
but  does  not  definitely  demonstrate,  that  the  Pathway  Project,  in 
addition  to  raising  scores  on  the  AWA,  also  increased  CAHSEE 
passing  rates,  particularly  for  Hispanics  and  Asians. 

Figures  4  and  5  compare  the  Year  1  and  2  CAHSEE  passing  rates 
for  Pathway  and  control  groups  with  the  statewide  rates  for  Hispanics 
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Table  7 


California  High  School  Exit  Exam  (CAHSEE)  Pass  Rates  in  Year  1 


Groups 

Control 

Pathway 

Difference  in  % 

n 

Pass 

Percent 

n 

Pass 

Percent 

All  10th  grade 

262 

197 

75.19 

313 

262 

83.71 

8.52** 

Female 

128 

106 

82.81 

162 

138 

85.19 

2.37 

Male 

134 

91 

67.91 

151 

124 

82.12 

14.21** 

Hispanic 

180 

127 

70.56 

211 

168 

79.62 

9.07* 

White 

34 

31 

91.18 

29 

24 

82.76 

-8.42 

Asian 

38 

34 

89.47 

64 

62 

96.88 

7.41 

Other 

10 

5 

50.00 

9 

8 

88.89 

38.89 

ELL 

71 

30 

42.25 

62 

32 

51.61 

9.36 

RFEP 

105 

97 

92.38 

149 

137 

91.95 

-.43 

IFEP 

12 

12 

100.0 

30 

28 

93.33 

-6.66 

EO 

74 

58 

78.38 

72 

65 

90.28 

11.90 

Non-FRPL 

64 

54 

84.38 

106 

93 

87.74 

3.36 

FRPL 

198 

143 

72.22 

207 

169 

81.64 

9.42* 

Note.  Other  —  Blacks  and  Native  Americans.  ELL  —  English  Language  Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially  Fluent 
English  Proficient;  EO  =  English  Only;  FRPL  =  eligible  for  Free  and  Reduced  Price  Lunch. 

>  <  .05.  *><.01.  **><.001. 


and  ELs.  Particularly  striking  is  that  in  both  program  years,  both 
Hispanic  and  EL  Pathway  students  exceeded  the  statewide  pass  rate 
for  these  student  groups.  This  shows  that  the  program  has  been 
particularly  successful  with  these  groups.  It  is  also  noteworthy  that 
71%  of  AUHSD  10th  graders  in  our  sample  who  took  the  CAHSEE 
receive  FRPL,  whereas  only  59%  of  10th  graders  who  took  the 
CAHSEE  statewide  are  designated  as  FRPL,  suggesting  that  the 
program  remains  effective  for  students  from  a  low  socioeconomic 
background. 

Discussion 

By  2020,  one  in  four  children  enrolled  in  America’s  K-12  public 
schools  will  be  Latino  (Maxwell,  2012).  Currently,  Latino  students 


Table  8 

California  High  School  Exit  Exam  ( CAHSEE )  Pass  Rates  in 
Year  2 


Groups 

Control 

Pathway 

Difference 
in  % 

n 

Pass 

Percent 

n 

Pass 

Percent 

All  10th  grade 

114 

79 

69.30 

122 

107 

87.70 

18.40*** 

Female 

47 

34 

72.34 

65 

59 

90.77 

18.42** 

Male 

67 

45 

67.16 

57 

48 

84.21 

17.05* 

Hispanic 

84 

59 

70.24 

73 

62 

84.93 

14.69* 

White 

16 

11 

68.75 

15 

13 

86.67 

17.92 

Asian 

6 

5 

83.33 

28 

27 

96.43 

13.10 

Other 

8 

4 

50.00 

6 

5 

83.33 

33.33 

ELL 

30 

6 

20.00 

19 

11 

57.89 

37.89** 

RFEP 

37 

35 

94.59 

48 

45 

93.75 

-.84 

IFEP 

7 

6 

85.71 

16 

16 

100.0 

14.29 

EO 

40 

32 

80.00 

39 

35 

89.74 

9.74 

Non-FRPL 

22 

17 

77.27 

48 

45 

93.75 

16.48* 

FRPL 

92 

62 

67.39 

74 

62 

83.78 

16.39* 

Note.  Other  =  Blacks  and  Native  Americans.  ELL  =  English  Language 
Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially 
Fluent  English  Proficient;  EO  =  English  Only;  FRPL  =  eligible  for  Free 
and  Reduced  Price  Lunch. 

><.05.  *><.01.  **><.001. 


already  represent  the  largest  ethnic  group  in  public  schools  (53.25%) 
in  California  (California  Department  of  Education,  2015).  At  both  the 
national  and  state  level,  educational  outcomes  for  Latino  students  lag 
behind  those  of  most  other  ethnic  and  racial  groups,  particularly  their 
White  peers.  Furthermore,  9.2%  of  the  U.S.  student  population  are 
ELs  (U.S.  Department  of  Education,  Institute  of  Educational  Sci¬ 
ences,  National  Center  for  Education  Statistics,  2015).  California’s 
1.6  million  EL  students  attending  K-12  public  schools  represent  25% 
of  the  state’s  student  population  and  one  third  of  all  ELs  in  the  country 
(Payan  &  Nettles,  2008).  But  many  other  states,  such  as  Arizona, 
Texas,  Florida,  New  York,  and  Illinois,  serve  large  and  growing 
numbers  of  ELs.  As  is  evident  from  NAEP  data,  these  students  are 
struggling  to  achieve  proficiency  in  writing.  Given  the  demographic 
trends  and  academic  performance  of  Latinos  in  general,  and  ELs  in 
particular,  literacy  practitioners  and  researchers  are  looking  for  inter¬ 
ventions  to  improve  academic  outcomes  for  Latinos  and  ELs  in 
secondary  schools. 

In  the  conclusion  of  their  meta-analysis  of  writing  instruction, 
Graham  and  Perin  (2007)  point  out  that  there  is  a  “serious  gap”  in 
the  research  literature  pertaining  to  secondary  adolescents  from 
low-income  families,  inner-city  settings,  and/or  with  low  English 
language  proficiency.  In  addition,  Fitzgerald  and  Amendum 
(2007)  report  no  empirical  studies  of  Grades  6  to  12  writing 
instruction  in  their  meta-analysis  that  involved  1988-2003  re¬ 
search  studies  of  K-12  writing  instruction  for  ELs  in  the  United 
States.  Given  the  dearth  of  research  regarding  effective  literacy 
instruction  for  Latinos  and  mainstreamed  ELs  in  secondary  school, 
we  believe  our  work  has  the  potential  to  contribute  to  the  scientific 
knowledge  base  regarding  strategies  that  may  reduce  the  achieve¬ 
ment  gap  in  writing  between  Latinos,  ELs,  and  their  peers  in 
Grades  7  to  12.  Our  findings  highlight  the  efficacy  of  implement¬ 
ing  a  cognitive  strategies  approach  to  reading  and  writing  instruc¬ 
tion  that  makes  visible  to  students  the  thinking  tools  accessed  by 
experienced  readers  and  writers  during  the  process  of  meaning 
construction.  Additionally,  our  study  results  are  consistent  with 
Taylor,  Pearson,  Peterson,  and  Rodriguez’s  (2005)  findings  on  the 
influence  of  teachers’  practices  that  encourage  cognitive  engage- 
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Table  9 

Odds  Ratio  Coefficients  From  Logistic  Regressions  for  Passing  the  CAHSEE  for  Years  1  and  2  With  Race  and  EL  Interactions 


Covar 

Year  1 

Year  2 

iates  (1)  (2) 

(3) 

(4) 

(A) 

(B) 

(C) 

(D) 

Pathway 

1.70  (.91)  1.39  (.81) 

.44  (-1.11) 

2.60(1.50) 

3.16(1.69) 

2.19  (.96) 

1.11  (.10) 

1.47  (.39) 

Male 

.77  (-1.06) 

.77  (-1.03) 

.78  (-.95) 

.53  (-1.92) 

.53  (-1.83) 

.54  (-1.83) 

Hispanic 

.62  (-.89) 

.36  (-1.64) 

.58  (-.96) 

.99  (-.03) 

.84  (-.48) 

.90  (-.29) 

Asian 

1.97  (1.08) 

.81  (-.50) 

2.00(1.10) 

2.02  (.63) 

.50  (-.69) 

1 .62  (.47) 

Other 

.40  (-1.45) 

.12*  (-2.28) 

.40  (-1.40) 

.22  (-1.92) 

.14**  (-3.15) 

.18*  (-2.44) 

ELL 

.21***  (-3.22) 

.20**  (-3.15) 

.28  (-1.92) 

.07***  (-5.43) 

.06***  (-5.83) 

.04***  (-4.99) 

RFEP 

2.46**  (2.55) 

2.37*  (2.36) 

4.01***  (3.68) 

2.19*  (2.33)  n 

2.05*  (2.18) 

3.16*  (2.37) 

IFEP 

3.04(1.16) 

2.77(1.04) 

— 

2.63  (.63) 

2.20  (.53) 

— 

FRPL 

.78  (-.77) 

.79  (-.71) 

.79  (-.70) 

.64  (-.51) 

.60  (-.57) 

.65  (-.49) 

Pathway  X 

Hispanic 

3.05  (1.60) 

— 

1.79  (.97) 

— 

Pathway  X 

Asian 

6.97*  (2.36) 

— 

12.81  (1.57) 

— 

Pathway  X 

Other 

18.52*  (2.37) 

— 

3.13(1.17) 

— 

Pathway  X 

ELL 

.534  (-.98) 

2.88(1.41) 

Pathway  X 

RFEP 

.361  (-1.92) 

.52  (-.89) 

Pathway  X 

IFEP 

— 

— 

Constant 

3.03**  (2.40)  7.69***  (3.98) 

13.50***  (4.20) 

6.08***  (3.06) 

2.26**  (2.37) 

10.49***  (3.51) 

14.74***  (3.74) 

13.26***  (3.16) 

Observations  575  575 

575 

533 

236 

236 

236 

220 

Note.  T  statistics  in  parentheses.  Standard  errors  have  been  adjusted  for  clustering  using  the  Huber- White  “sandwich”  estimator  in  STATA.  Results  for 
IFEP  students  have  not  been  reported  in  Models  4  and  D  due  to  small  sample  sizes.  Other  =  Blacks  and  Native  Americans.  ELL  =  English  Language 
Learner;  RFEP  =  Reclassified  Fluent  English  Proficient;  IFEP  =  Initially  Fluent  English  Proficient;  FRPL  =  eligible  for  Free  and  Reduced  Price  Lunch. 
>  <  .05.  *><.01.  **><.001. 


ment  at  the  elementary  level  and  confirm  Langer’s  (2000)  findings 
at  the  secondary  level.  In  particular,  our  results  indicate  that 
teachers  can  learn  to  engage  Latinos  and  mainstreamed  ELs  in 
higher  level  interpretive  reading  and  analytical  writing  about  texts 
through  direct  strategy  instruction,  modeling  of  strategy  use,  and 
creating  opportunities  for  students  to  practice  and  apply  these 
skills  through  teacher  coaching  and  feedback. 

Our  main  finding  is  that  our  randomized  controlled  trial  of  the 
Pathway  Project  in  the  AUHSD  largely  replicated  the  earlier  RCT 
in  the  SAUSD  (Year  1  SAUSD,  d  =  .35,  and  Year  2,  SAUSD  d  = 
.67;  Year  1  AUHSD,  d  =  .48,  and  Year  2,  d  —  .60).  Further,  the 
demographic  diversity  of  the  AUHSD  enabled  us  to  test  whether 
explicitly  training  teachers  to  teach  their  students  strategies  to  read 


CA  High  School  Exit  Exam  Pass 
Rates  for  Year  1 


■  Control  Treatment  *  State 


Figure  4.  California  High  School  Exit  Exam  pass  rates  for  Hispanic  and 
English  Learner  students  by  treatment  condition  compared  with  the  state 
averages  (by  percentage  points)  during  Academic  Year  2012-2013.  The 
star  indicates  statistical  significance  in  pass  rates  between  Pathway  and 
control  groups.  Percentages  reported  come  from  Table  7.  *  p  <  .05.  **  p  < 
.01.  **>  <  .001. 


and  write  about  complex  texts,  including  planning  and  goal  setting, 
tapping  prior  knowledge,  making  connections,  forming  interpreta¬ 
tions,  monitoring,  revising  meaning,  reflecting  and  relating,  and 
evaluating  could  help  to  close  the  achievement  gap  in  writing  for 
Latinos  in  Grades  7  to  1 2  as  well  as  improve  educational  outcomes 
for  ELs.  Our  impact  estimates  suggest  large,  positive  effects  on  the 
AW  A,  which  is  the  on-demand  writing  assessment  most  closely 
tied  to  the  Pathway  Project,  in  both  Year  1  and  Year  2  of  the 
program,  as  well  as  positive  effects  on  the  CAHSEE  pass  rates  in 
both  years  of  the  program.  However,  due  to  particularly  small 
sample  sizes  for  the  regression  analyses  of  the  CAHSEE,  the 
Pathway  effect  failed  to  attain  statistical  significance  in  these 
regressions.  We  thus  consider  the  CAHSEE  effects  to  be  sugges- 


CA  High  School  Exit  Exam  Pass 
Rates  for  Year  2 


Hispanic  EL 


■  Control  ■  Treatment  » State 

Figure  5.  California  High  School  Exit  Exam  Pass  Rates  for  Hispanic  and 
English  Learner  students  compared  with  the  state  averages  (by  percentage 
points)  during  Academic  Year  2013-2014.  Stars  indicate  statistical  signif¬ 
icance  in  pass  rates  between  Pathway  and  control  groups.  Percentages 
reported  come  from  Table  8.  *  p  <  .05.  **  p  <  .01.  ***  p  <  .001. 
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tive  rather  than  definitive.  To  place  our  results  in  a  broader 
research  context,  it  is  useful  to  compare  the  magnitude  of  the 
Pathway  effect  to  results  from  Graham  and  Perm’s  (2007)  com¬ 
prehensive  meta-analysis  of  writing  instruction  for  adolescent  stu¬ 
dents.  The  magnitude  of  the  Pathway  effect  (d  =  .48  in  Year  1  to 
d  ~  -60  in  Year  2)  is  higher  than  the  mean  weighted  effect  size  for 
process  writing  (d  =  .32),  and  similar  to  other  writing  interven¬ 
tions  that  include  professional  development  for  teachers  ( d  =  .46). 
However,  the  magnitude  of  the  Pathway  effect  on  writing  (d  =  .48 
in  Year  1  to  d  —  .60  in  Year  2)  is  smaller  than  the  mean  weighted 
effect  size  for  strategy  instruction  (d  —  .93).  It  is  unclear  precisely 
why  the  Pathway  effect  is  smaller,  on  average,  than  the  effect  size 
for  strategy  instruction.  The  complex  nature  of  language  develop¬ 
ment  and  its  effect  on  strategy  use  and  writing  development 
require  further  exploration.  Students’  ability  to  benefit  from  strat¬ 
egy  instruction  might  vary  as  a  function  of  English  proficiency. 
Additionally,  several  of  the  strategy  interventions  included  in  the 
meta-analysis  were  delivered  by  research  staff,  whereas  the  current 
intervention  was  delivered  by  classroom  teachers,  and  this  might 
have  affected  the  differential  outcome. 

The  AWA  means  that  we  report  (see  Tables  3  and  4)  show 
that  although  White,  Hispanic,  and  Asian  Pathway  students  all 
had  statistically  significant  gains  from  pretest  to  posttest,  Path¬ 
way  Hispanics  demonstrated  larger  gains  from  pretest  to  post¬ 
test  and  higher  average  posttest  scores  than  their  White  peers  in 
the  control  group,  and  they  were  less  than  one  quarter  point 
behind  their  White  Pathway  peers  in  Year  1.  In  Year  2,  Pathway 
White  and  Hispanic  students  again  had  statistically  significant 
gains  from  pretest  to  posttest.  Pathway  Hispanics  again  showed 
the  largest  gains.  Further,  they  had  higher  average  posttest 
scores  than  White  control  students  and  were  on  average  just  half 
a  point  (on  the  AWA)  behind  their  White  Pathway  peers.  These 
results  indicate  a  substantial  reduction  in  the  achievement  gap 
between  Latino  students  and  their  White  peers.  Regression 
analyses  confirm  the  significant  gains  made  by  Hispanic  stu¬ 
dents;  after  controlling  for  other  demographic  variables,  there 
was  still  a  marginally  significant  interaction  for  AWA  gains 
made  by  Hispanic  students  in  Pathway  in  Year  1  and  Year  2 
(see  Table  6). 

Similarly,  although  ELs,  RFEP  learners,  and  EO  students  all 
gained  significantly  from  pretest  to  posttest  on  the  AWA,  ELs 
had  the  greatest  pre/post  gains  of  any  group  in  Year  1  and  made 
greater  gains  than  their  EO  peers  in  Year  2.  Further,  their 
posttest  scores  were  one  quarter  of  a  point  behind  their  EO 
control  peers,  again  demonstrating  a  further  closing  of  the 
achievement  gap  (see  Tables  3  and  4).  Notably,  in  Year  2,  there 
was  a  significant,  positive  interaction  for  RFEP  students  in 
Pathway  compared  with  EO  speakers  (our  reference  group), 
which  highlights  the  efficacy  of  Pathway  Project  strategies  for 
students  whose  native  language  may  not  be  English.  Our  find¬ 
ings  for  the  CAHSEE  generally  support  these  results  (Tables  6 
and  7),  but  small  sample  sizes  in  the  regressions  (see  Table  9) 
suggest  that  these  results  should  be  taken  as  suggestive  rather 
than  definitive. 

What  might  account  for  the  increase  in  the  AWA  effect  size 
from  Year  1  to  Year  2  and  the  higher  CAHSEE  pass  rates  of 
Pathway  students  compared  with  control  students  between  Year 
1  and  Year  2?  Pressley  (2002)  has  observed  that  it  takes  time  for 
teachers  to  internalize  and  implement  cognitive  strategies  in¬ 


terventions  with  confidence  and  competence.  At  the  end  of  Year 
2,  teachers  were  asked  to  describe  which  strategies  they  felt 
were  most  effective  in  increasing  the  literacy  skills  (both  read¬ 
ing  and  writing)  of  their  students.  The  most  highly  reported 
“effective  strategy”  was  the  color  coding  of  essays — both  stu¬ 
dents’  own  papers  and  model  or  anchor  papers  (24  teachers 
reported).  The  next  most  effective  strategies  were  using  the 
cognitive  strategies  tool  kit  and  teaching  the  introductory  and 
revision  tutorials  (21  teachers  reported  each  of  these  as  effec¬ 
tive).  Based  on  teacher  reports  such  as  these  and  classroom 
observations  that  showed  notable  differences  in  Pathway  and 
control  teachers’  instruction  at  the  end  of  Year  2,  we  speculate 
that  teachers  were  not  only  more  comfortable  with  implement¬ 
ing  the  Pathway  Project  in  Year  2,  but  their  recognition  of  the 
benefit  that  their  students  received  in  Year  1  motivated  them  to 
integrate  Pathway  strategies  into  their  curriculum  with  greater 
fidelity  and  intensity. 

Our  study  had  several  limitations.  One  is  that,  because  of 
differential  teacher  attrition  across  schools,  which  was  corre¬ 
lated  with  differences  in  the  student  demographics  of  these 
schools,  demographic  imbalances  between  Pathway  and  control 
group  students  occurred  in  the  second  year  of  the  study.  (How¬ 
ever,  these  demographics  were  controlled  in  regression  analy¬ 
ses.)  A  second  limitation  is  that  despite  the  fact  that  the 
AUHSD  is  more  ethnically  and  linguistically  diverse  than  the 
SAUSD,  only  12%  of  the  students  are  White  and  only  23%  are 
EO.  Our  pre-post  writing  assessment  of  students  in  Grades  7  to 
12  had  a  sample  size  that  was  large  enough  to  make  compari¬ 
sons  among  race/ethnicity  and  language  subgroups.  However, 
in  our  10th  grade  sample  of  the  CAHSEE  pass  rates,  some 
numbers  dropped  below  a  level  that  was  stable  enough  to  make 
such  comparisons.  In  a  current  study  involving  over  8,000 
students  from  four  large,  urban  school  districts  in  the  Southern 
California  region,  we  hope  to  compare  the  impact  of  the  Path¬ 
way  Project  on  an  even  more  diverse  student  population.  Our 
research  indicates  that  cognitive  strategies  instruction  has  a 
salutary  effect  on  Latinos  and  ELs  as  evidenced  by  their  im¬ 
proved  AWA  scores  and  CAHSEE  pass  rates.  However,  other 
factors,  including  academic  vocabulary,  grammar,  and  a  range 
of  discourse  features,  may  also  merit  instruction  in  English 
Language  arts  classes.  In  addition,  many  learner  characteristics 
might  also  affect  students’  development  and  use  of  strategies, 
including  the  learners’  first-  and  second-language  proficiency, 
background  knowledge,  experience,  previous  education,  and 
gender.  Further  research  is  needed  to  address  the  limitations  of 
the  current  study  and  to  replicate  these  promising  findings  in 
other  school  districts. 

Conclusion 

Throughout  the  nation,  literacy  practitioners  and  researchers 
are  looking  for  effective  practices  to  help  Latinos  and  ELs  in 
secondary  school  improve  their  academic  writing.  The  robust 
findings  from  this  study  yield  promising  results  that  appear  to 
close  the  achievement  gap  in  writing  for  Latinos  and  help  ELs 
to  gain  ground  by  narrowing  the  achievement  gap  between  ELs 
and  their  native  English  speaking  peers.  Further,  these  findings 
highlight  the  importance  of  sustained  ongoing  professional  de- 


18 


OLSON.  MATUCHNIAK,  CHUNG,  STUMPF,  AND  FARKAS 


velopment  for  teachers  if  they  are  to  effectively  teach  academic 

reading  and  writing  to  Latinos  and  ELs. 
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Pretest  Written  by  an  Eighth  Grade  Hispanic  Female,  Reclassified  English  Learner 


Received  a  5  (3+2)  -  combined  score  of  two  readers 
Student  Paper 
Student  Code:  1C124 

The  theme  in  the  story  “The  Medicine  Bag,”  I  think  that  the 
main  theme  of  this  story  is  tradition.  First  of  all,  Martin,  the 
grandson  wasn’t  as  surprise  to  see  grampa  comeing  in  his  way.  He 
was  affraid  his  friends  will  juge  him  since  hes  grampa  didnt  look 
as  an  indian  but  was.  He  has  took  his  friends  all  about  grampa’ s 
stories  about  animals  and  people,  which  made  them  want  to  met  an 
indian.  As  grampa  staded  in  their  home  or  a  couple  of  weeks  he 


dicided  to  bring  his  friends  over  to  meet  grampa.  Martin  was 
happy  because  they  all  loved  him.  Days  passed  and  grampa  has 
told  that  he  has  to  give  him  the  “Medicine  Bag.”  He  knew  that 
he  had  to  wear  it  but  was  also  kinda  embarresed  wearing  it  to 
school.  But  grampa  talked  to  him  and  gave  him  the  Medicine 
Bag.  That  day  grampa  was  sent  to  the  hospital  and  dieded  days 
later.  This  made  Martin  feel  sad  for  dieding,  but  also  was  happy 
to  have  a  part  of  grampa  gaven  to  him.  In  conclusion,  tradition 
can  bring  lots  of  love  and  memory  to  one  and  traditional  ideams 
is  a  way  to  remember  a  loved  one. 


Appendix  B 

Posttest  Written  by  an  Eighth  Grade  Hispanic  Female,  Reclassified  English  Learner 


Received  a  9  (5+4)  -  combined  score  of  two  readers 
Student  Paper 
Student  Code:  1024 

He’s  all  there!  He’s  all  there!”  quotes  the  narrator  as  sees  his 
disabled  brother  smiling  at  him.  In  the  story,  “The  Scarlet  Ibis,”  by 
James  Hurst,  the  narrator  learns  a  valueable  lesson  to  think  before 
you  do  any  action  or  you  might  regret  it. 

In  the  beginning  of  the  story,  the  narrator  reacts  to  Doodle,  his 
disabled  brother,  violanty  because  he  wants  a  brother  who’s  “all 
there,”  not  a  disabled  one  who’s  “not  there.”  The  narrator  has 
already  made  plans  to  kill  Doodle  by  smothering  him  with  a  pillow 
because  Doodle  “wasn’t  there.”  However,  his  evil  idea  was 


washed  away  when  Doodle  looked  straight  at  him  and  smiled. 
“He’s  all  there.  He’s  all  there,”  yelled  the  narrator  with  excitement. 

In  the  middle  of  the  story,  the  narrator  begins  to  have  faith  in 
Doodle  finally  becoming  normal.  He  is  eager  to  teach  Doodle  how 
to  walk.  “I  just  can't  do  it”  quotes  Doodle,  but  the  narrator  doesn’t 
want  him  to  fail.  With  practice  and  practice,  Doodle  finally  learned 
how  to  walk.  As  the  narrator  showed  their  parents  that  Doodle 
could  walk,  he  began  to  cry.  Disappointed  and  sad,  the  narrator 
wanted  Doodle  to  walk  only  because  he  was  ashamed  of  having  a 
crippled  brother.  He  says,  “They  did  not  know  that  I  did  it  for 
myself;  that  pride,  whose  slave  I  was,  spoke  to  me  louder  than  all 
of  their  voices.” 


(Appendices  continue ) 
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Towards  the  end  of  the  story,  the  family  finds  a  scarlet  ibis, 
weak  and  sick,  dead  on  the  ground.  This  disappoints  Doodle  which 
makes  him  want  to  bury  the  bird.  After  that,  the  boys  decide  to  go 
to  the  Horsehead  Landing  where  they  got  to  a  skiff  and  floated 
down  the  creek  with  a  tide.  As  darkness  descended,  they  both 
decided  to  go  home.  Tired  and  frightened,  Doodle  began  to  walk 
closer  to  the  narrator  which  got  him  annoyed  and  he  decided  to  run 
off  without  Doodle.  “Brother,  Brother,  dont  leave  me!  Don’t  leave 
me!  The  narrator  waited  for  Doodle  to  catch  up  to  him  but 
nothing  happened.  Finally,  he  went  back  and  found  him  huddled 
beneath  a  red  bush.  His  legs,  bent  sharply  at  the  knees,  had  never 
before  seemed  so  fragile,  so  thin.  The  narrator  screamed  and  began 


to  weep.  “Doodle!”  he  cried  shaking  him;  but  their  was  no  answer. 
He  layed  their  crying,  sheltering  his  fallen  scarlet  ibis. 

The  Scarlet  Ibis  is  a  symbol  of  the  story  because  Doodle  is  a 
week  and  sick  boy  left  behind  in  an  unknowned  place  and  falls  into 
a  bush  and  dies.  The  Scarlet  Ibis  is  similar  to  Doodle;  it  comes  to 
an  unknowned  place  and  dies.  The  theme  of  this  story  is  to  think 
before  you  do  an  action  or  you  might  regret  it. 
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Supporting  the  reading  comprehension  and  content  knowledge  acquisition  of  English  language  learners  (ELs) 
requires  instructional  practices  that  continue  beyond  developing  the  foundational  skills  of  reading.  In 
particular,  the  challenges  ELs  face  highlight  the  importance  of  teaching  reading  comprehension  practices  in 
the  middle  grades  through  content  acquisition.  We  conducted  a  randomized  control  trial  to  examine  the 
efficacy  of  a  content  acquisition  and  reading  comprehension  intervention  implemented  in  eighth-grade  social 
studies  classrooms  with  English  language  learners.  Using  a  within-teacher  design,  in  which  18  eighth-grade 
teachers’  social  studies  classes  were  randomly  assigned  to  treatment  or  comparison  conditions.  Teachers 
taught  the  same  instructional  content  to  treatment  and  comparison  classes,  but  the  treatment  classes  used 
instructional  practices  that  included  comprehension  canopy,  essential  words,  knowledge  acquisition,  and 
team-based  learning.  Students  in  the  treatment  group  ( n  =  845)  outperformed  students  in  the  comparison 
group  (n  =  784)  on  measures  of  content  knowledge  acquisition  and  content  reading  comprehension  but  not 
general  reading  comprehension.  Both  ELs  and  non-ELs  who  received  the  treatment  outperformed  those 
assigned  to  the  BAU  comparison  condition  on  measures  of  content  knowledge  acquisition  (ES  =  0.40)  and 
content-related  reading  comprehension  (ES  =  0.20).  In  addition,  the  proportion  of  English  language  learners 
in  classes  moderated  outcomes  for  content  knowledge  acquisition. 
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Designing  and  implementing  effective  instruction  for  the  grow¬ 
ing  number  of  English  language  learners  (ELs)  in  public  schools  is 
a  significant  educational  challenge.  Approximately  20%  of  stu¬ 
dents  in  the  United  States  are  children  of  immigrant  parents  (Fix  & 
Passel,  2003),  and  although  they  are  not  all  ELs,  many  are,  with 
approximately  10.5%  of  all  U.S.  students  identified  as  ELs  (Na- 
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tional  Clearinghouse  for  English  Language  Acquisition,  2006). 
The  expectation  is  that  within  the  next  10  to  15  years,  as  many  as 
one  in  four  children  enrolled  in  schools  in  the  United  States  will  be 
ELs.  Although  ELs  have  many  assets,  such  as  linguistic  and 
cultural  diversity,  they  commonly  face  educational  challenges, 
including  low  achievement  across  reading,  writing,  history,  math¬ 
ematics,  and  other  academic  areas  (Lesaux,  Kieffer,  Kelley,  & 
Harris,  2014;  Snow  &  Uccelli,  2009),  and  are  at  a  disproportion¬ 
ately  high  risk  for  school  dropout  (Hernandez,  2012;  Kennedy  & 
Monrad,  2007).  Not  all  language-minority  students  have  the  same 
trajectory  for  school  success.  Students  who  begin  kindergarten 
with  proficiency  in  English  have  academic  trajectories  similar  to 
non-ELs,  whereas  students  who  enter  school  with  limited  English 
proficiency  do  not  fare  as  well,  demonstrating  weaker  learning 
trajectories  that  are  quite  divergent  from  their  non-EL  peers  by  the 
end  of  elementary  school  (Kieffer,  2008). 

Despite  the  growing  number  of  ELs  in  schools  and  the  increased 
attention  to  improving  their  academic  opportunities,  ELs  continue  to 
demonstrate  difficulties  beyond  the  elementary  grades.  In  particular, 
they  demonstrate  difficulty  in  literacy,  with  only  26%  of  ELs  in  the 
eighth  grade  scoring  above  “basic”  on  reading  achievement  tests 
(National  Center  for  Education  Statistics,  2013).  Of  further  concern, 
these  data  have  not  changed  significantly  in  any  state  in  the  previous 
10  years.  Scores  for  ELs  are  35  to  40  scale  score  points  below  students 
who  are  not  ELs  (National  Center  for  Education  Statistics,  2013).  A 


22 


IMPROVING  CONTENT  KNOWLEDGE  AND  COMPREHENSION 


23 


similar  pattern  is  observed  on  the  fourth-grade  reading  test  from  the 
National  Assessment  of  Educational  Progress. 

ELs  frequently  score  lower  on  achievement  tests  in  part  because  of 
their  challenges  in  developing  background  knowledge  and  vocabulary 
in  English  (National  Center  for  Education  Statistics,  2009).  ELs  also 
have  the  dual  task  of  concurrently  learning  English  and  content, 
increasing  the  likelihood  that  the  rate  they  learn  to  read  and  under¬ 
stand  English  will  influence  their  content  knowledge.  Thus,  low 
scores  in  reading  do  not  bode  favorably  for  middle-school  ELs  in 
content  classrooms,  in  which  reading  for  understanding  is  an  integral 
part  of  success.  As  ELs  move  from  elementary  to  middle  school,  the 
demands  for  sophisticated  language,  literacy,  and  background  knowl¬ 
edge  increase,  requiring  teachers  to  access  effective  instructional 
practices  that  are  beneficial  to  a  range  of  learners,  including  ELs. 

Lack  of  Opportunity 

In  addition  to  the  heightened  challenges  of  learning  content  in  a 
language  they  are  simultaneously  learning  to  read  and  understand, 
ELs  may  have  restricted  opportunities  to  learn  based  on  their  lack  of 
access  to  high-quality  teachers,  proficient  student  learners,  and  curri¬ 
cula.  For  example,  Callahan  (2005)  reported  that  “tracking”  ELs 
played  a  significant  role  in  their  learning  achievement.  Her  analysis 
revealed  that  ELs  were  primarily  clustered  in  classes  that  were  not 
college  preparatory.  To  the  extent  that  opportunity  to  learn  content 
and  academic  vocabulary  is  related  to  the  curriculum  demands  of  the 
class,  and  that  teachers  are  more  likely  to  provide  challenging  content 
and  discourse  opportunities  to  students  who  are  proficient  in  English, 
ELs  in  classes  with  significant  numbers  of  non-ELs  may  be  more 
likely  to  access  high-level  academic  vocabulary  and  content  learning. 
The  reverse  is  also  likely — teachers  of  classes  with  high  concentra¬ 
tions  of  ELs  may  provide  fewer  opportunities  for  rich  language 
discourse  and  content  learning. 

Importance  of  Enhancing  Reading  Comprehension  of 
ELs  in  Middle  Grades 

The  previously  discussed  data,  as  well  as  the  findings  from  two 
practice  guides  (Baker  et  al.,  2014;  Francis,  Rivera,  Lesaux,  Kieffer, 
&  Rivera,  2006)  that  recommend  content  area  instruction  as  a  focus 
for  learning  new  concepts  and  knowledge,  underscore  the  importance 
of  teaching  academic  content  and  literacy  to  ELs  in  the  middle  grades. 
The  urgent  need  to  improve  instructional  practices  for  ELs  in  the 
middle  grades  is  demonstrated  in  these  students’  slow  development  of 
reading  comprehension  (Mancilla-Martinez,  Kieffer,  Biancarosa, 
Christodoulou,  &  Snow,  2011).  Further,  substantial  numbers  of  ELs 
demonstrate  “late-emerging”  reading  difficulties,  or  reading  problems 
that  emerge  after  fourth  grade  (Kieffer,  2010).  This  finding  suggests 
that  ELs  can  often  master  the  foundational  skills  of  word  reading  with 
adequate  fluency,  but  that  as  the  syntax,  vocabulary,  and  background 
knowledge  of  texts  become  more  complex,  ELs’  reading  difficulties 
manifest.  At  each  developmental  period,  as  determined  by  grade  level 
(Grades  3,  5,  and  8),  ELs  were  found  to  be  at  substantially  greater  risk 
than  native  English  speakers  for  reading  difficulties  that  were  not 
recognized  prior  to  Grade  3  (Kieffer,  2010).  Thus,  supporting  the 
reading  comprehension  of  ELs  requires  instructional  practices  that 
continue  beyond  developing  the  foundational  skills  of  reading.  In 
particular,  the  challenges  ELs  face  highlight  the  importance  of  teach¬ 
ing  reading  comprehension  practices  in  the  middle  grades. 


Purpose  of  the  Study 

Considering  the  high  need  for  effective  instructional  practices 
that  enhance  both  reading  comprehension  and  knowledge  acquisi¬ 
tion  for  ELs,  we  modified  Promoting  Adolescents’  Comprehension 
of  Text  (PACT) — a  previously  developed  package  of  instructional 
practices — by  interweaving  features  of  instruction  associated  with 
improved  outcomes  for  ELs  (e.g.,  additional  focus  on  academic 
vocabulary  and  peer  discourse).  We  selected  PACT  for  several 
reasons.  First,  PACT  has  demonstrated  efficacy  through  previous 
randomized  control  trials  with  eighth-grade  students  and  students 
with  disabilities  (Swanson,  Wanzek,  Vaughn,  Roberts,  &  Fall, 
2015;  Vaughn  et  al.,  2013,  2015;  Wanzek  et  al.,  2015).  In  previous 
studies  examining  overall  effects  for  all  learners,  PACT  was  as¬ 
sociated  with  improved  outcomes  in  reading  comprehension 
(Vaughn  et  al.,  2013),  content  acquisition  and  vocabulary  knowl¬ 
edge  (Vaughn  et  al.,  2013,  2015),  and  sustained  content  and 
vocabulary  knowledge  at  multiple  points  through  follow-up  mea¬ 
sures  (Vaughn  et  al.,  2015).  Second,  research  suggests  that  ELs 
and  below-grade-level  readers  exhibit  many  of  the  same  learning 
challenges  in  the  middle  grades  and  that  similar  instruction  may  be 
necessary  for  both  groups  (Lesaux  &  Kieffer,  2010).  Thus,  many 
of  the  instructional  practices  of  PACT  held  promise  for  ELs, 
particularly  if  instructional  enhancements  were  added.  Third,  the 
platform  of  PACT  instructional  practices  is  well  aligned  with  best 
practices  for  teaching  ELs,  leading  us  to  hypothesize  that  with 
appropriate  modifications  (described  in  the  next  section),  PACT 
would  yield  positive  outcomes  for  ELs. 

This  study  represents  a  randomized  control  trial  of  the  efficacy 
of  a  PACT  treatment  modified  for  ELs  with  eighth-grade  students 
in  schools  with  moderate  to  high  concentrations  of  ELs,  ensuring 
that  ELs  would  be  included  in  all  participating  classes.  We  hy¬ 
pothesized  that  students  who  were  not  ELs  would  perform  simi¬ 
larly  to  students  in  previous  studies  (Vaughn  et  al.,  2013,  2015), 
with  the  treatment  students  demonstrating  statistically  significantly 
higher  scores  than  comparison  students  on  content  knowledge 
acquisition  and  content-related  reading  comprehension.  We  also 
hypothesized  that  ELs  in  the  treatment  condition  would  outper¬ 
form  ELs  in  the  comparison  condition  on  content  acquisition  and 
content  reading  comprehension.  Thus,  we  hypothesized  that  the 
modified  version  of  PACT  would  have  a  universally  positive  effect 
on  all  learners.  We  further  hypothesized  that  participants  in  the 
treatment  condition  would  not  outperform  participants  in  the  com¬ 
parison  condition  on  the  distal  measure  of  reading  comprehension. 
Finally,  we  acknowledge  the  important  influence  of  classmates  on 
a  given  student’s  outcomes.  The  considerable  literature  addressing 
peer  effects  on  learning  (e.g.,  Angrist  &  Lang,  2004;  Gottfried, 
2014;  Hanushek,  Kain,  Markman,  &  Rivkin,  2003)  has  focused  on 
socioeconomic  status  and  prior  achievement  in  large-scale  extant 
databases.  We  are  unaware  of  studies  that  consider  language- 
related  peer  variables,  certainly  not  in  the  context  of  a  discourse- 
based  intervention  designed  to  improve  reading  comprehension 
and  content  knowledge.  We  hypothesized  that  PACT’S  effect 
would  depend  in  part  on  class  levels  of  English  academic  language 
proficiency,  which  we  operationalized  as  the  proportion  of  ELs  (or 
non-ELs)  in  the  classroom.  We  expected  that  the  proportion  of  ELs 
in  a  given  class  would  moderate  PACT’S  effect,  with  increasing 
class-level  prevalence  of  ELs  corresponding  to  diminishing  treat¬ 
ment  effects  for  all  students — particularly  for  ELs. 
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Features  of  Effective  Instructional  Practices  for  ELs 

The  set  of  instructional  practices  tested  in  this  randomized 
control  trial  can  be  woven  into  content  area  instruction  (i.e.,  social 
studies)  to  enhance  content  learning  and  comprehension  for  all 
learners,  with  a  specific  focus  on  ELs.  Using  the  PACT  instruc¬ 
tional  practices  as  a  foundation,  we  integrated  research-based 
knowledge  derived  from  multiple  sources,  including  practice 
guides  (Baker  et  al.,  2014;  Francis  et  al.,  2006),  to  enhance  the 
features  of  instruction  and  promote  best  practice  for  teaching  ELs. 
Although  many  of  these  practices  were  already  part  of  the  foun¬ 
dation  of  PACT,  we  enhanced  the  focus  on  academic  vocabulary 
by  teaching  theme-related  vocabulary  words  across  time  and  ac¬ 
tivity,  integrating  oral  and  written  instruction  into  content  learning, 
and  using  both  paired  learning  and  team-based  learning  (TBL; 
Michaelsen  &  Sweet,  2011)  to  provide  peer  interaction  and  ex¬ 
tended  practice  with  feedback  (Vaughn  et  ah,  2013,  2015;  Wanzek 
et  ah,  2015;  see  sample  lessons  in  the  online  supplemental  mate¬ 
rials). 

Reviews  of  the  research  on  effective  instruction  for  ELs  (August 
&  Shanahan,  2006;  Baker  et  ah,  2014;  Francis  et  ah,  2006) 
emphasize  the  importance  of  addressing  academic  language  by 
providing  direct  and  systematic  instruction  of  the  English  language 
while  teaching  content  across  the  disciplines.  In  the  PACT  treat¬ 
ment,  informational  text  reading  that  included  target  vocabulary 
was  central  to  every  unit  and  anchored  the  instruction  of  academic 
vocabulary.  Essential  words  in  each  unit  were  taught  explicitly  and 
reinforced  by  engaging  students  in  reading,  speaking,  and  writing 
activities,  in  which  students  applied  the  meaning  of  the  words  in 
multiple  and  meaningful  contexts.  Academic  vocabulary  teaching 
was  enhanced  in  the  modified  version  of  PACT  by  providing 
instruction  on  more  abstract  terms  that  students  need  to  commu¬ 
nicate  across  the  disciplines  and  that  are  needed  for  school  tests 
and  tasks — for  example,  academic  expressions  for  comparing  and 
contrasting  and  using  cause  and  effect  (Lesaux,  Kieffer,  Faller,  & 
Kelley,  2010). 

In  addition  to  the  TBL  in  the  original  versions  of  PACT,  we 
provided  structured  opportunities  for  ELs  to  participate  in  aca¬ 
demic  discussions  and  writing  that  supported  the  use  of  learned 
content  vocabulary  (August,  Branum-Martin,  Cardenas-Hagan,  & 
Francis,  2009;  Lesaux  et  al.,  2014).  For  example,  in  knowledge 
application  activities,  students  were  taught  and  expected  to  justify 
their  answers  by  using  learned  academic  vocabulary  and  citing 
evidence  from  informational  texts. 

Based  on  intervention  studies  with  ELs,  we  incorporated  addi¬ 
tional  features  of  instruction  associated  with  improved  outcomes 
for  ELs  (August  et  al.,  2009;  Vaughn  et  al.,  2009).  Students 
worked  in  pairs  or  small  groups  during  most  PACT  components  to 
prepare  for  discussing  and  writing  responses  to  inferential  ques¬ 
tions  and  summaries  that  focused  on  building  knowledge  and 
developing  academic  language.  Instruction  on  new  social  studies 
content  was  supplemented  with  brief  videos,  visuals,  and  graphic 
organizers  to  provide  students  the  necessary  background  informa¬ 
tion  to  participate  in  academic  discourse.  Finally,  one  of  the 
principles  of  the  TBL  comprehension  checks  and  knowledge  ap¬ 
plication  activities  was  continuous,  targeted  feedback,  in  which 
teachers  affirmed  or  corrected  students’  understanding  of  the  con¬ 
tent. 


Method 

Research  Design 

We  conducted  a  randomized  control  trial  to  test  the  efficacy  of 
a  modified  version  of  the  PACT  reading  comprehension  and 
content  acquisition  intervention  in  eighth-grade  social  studies 
classes.  Participants  included  English-speaking  students,  ELs,  and 
former  ELs.  To  be  selected,  schools  (and  their  districts)  had  to 
serve  high  numbers  of  ELs — and  each  class  had  to  have  at  least 
one  identified  EL.  In  the  selected^  schools,  all  eighth-grade  social 
studies  teachers  participated,  and  their  class  sections  were  ran¬ 
domly  assigned  to  the  treatment  or  comparison  condition.  Each 
teacher  taught  both  PACT  treatment  classes  and  comparison 
classes,  and  the  same  social  studies  content  was  delivered  to 
students  in  both  conditions,  albeit  using  the  interrelated  compo¬ 
nents  of  PACT  in  treatment  classes  only. 

Setting  and  Participants 

School  sites.  The  PACT  study  was  implemented  during  the 
2013-2014  academic  year  across  seven  middle  schools  in  three 
school  districts  in  two  distinct  areas  of  the  United  States.  Three  of 
the  schools  were  in  the  southwestern  United  States — two  in  a 
large,  diverse  urban  district,  and  another  in  a  smaller,  predomi¬ 
nantly  Hispanic  suburban  district.  Four  of  the  schools  were  in  one 
district  in  the  southeastern  United  States — about  40%  of  the  stu¬ 
dents  in  these  schools  were  Hispanic.  The  proportion  of  students 
identified  as  ELs  in  the  schools  ranged  from  15.4%  to  44.5%. 
Although  we  recruited  school  districts  that  served  the  highest 
numbers  of  ELs  in  our  surrounding  areas,  districts  ultimately 
dictated  which  schools  could  participate  in  the  study.  Furthermore, 
principals  ultimately  decided  whether  to  participate,  which  re¬ 
sulted  in  the  wide  range  of  EL  proportions  across  schools.  We 
report  district-identified  EL  classification,  but  the  criteria  that  state 
departments  of  education  use  may  differ.  In  the  five  schools  for 
which  such  data  were  available,  the  proportion  of  students  who 
qualified  for  free  or  reduced-price  meals  ranged  from  48.8%  to 
82.6%.  Additional  school-level  demographic  information  is  dis¬ 
played  in  Table  1. 

Teachers.  The  18  teacher  participants  (nine  women  and  nine 
men)  were  eighth-grade  U.S.  history  teachers  who  implemented 
the  intervention  with  researchers’  support  in  treatment  classes  and 
continued  with  typical  instruction  in  comparison  classes.  All  of  the 
teachers  had  a  bachelor’s  degree  and  five  had  a  master’s  degree. 
Their  teaching  experience  ranged  from  less  than  1  year  to  34  years 
(M  =  10.13,  SD  =  10.48).  Teachers’  ethnicities  included  83.3% 
White,  16.7%  Hispanic,  and  5.6%  Asian. 

Students.  A  total  of  1,629  eighth-grade  students  were  as¬ 
signed  to  94  U.S.  history  class  sections.  Classes  were  randomly 
assigned  within  teacher  to  49  treatment  (845  students)  and  45 
comparison  (784  students)  classes.  When  teachers  had  an  odd 
number  of  classes,  randomization  assigned  extra  classes  to  treat¬ 
ment.  Of  the  participants,  26.7%  were  current  ELs  or  held  an  EL 
status  within  the  last  2  years.  Students’  EL  designation  was  deter¬ 
mined  in  part  by  district  language-proficiency  tests.  Similar  to 
other  content  area  studies  with  secondary  ELs  (Snow,  Lawrence, 
&  White,  2009;  Vaughn  et  al.,  2009),  recently  exited  EL  students 
were  included  in  the  EL  sample  because  they  require  continued 
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Table  1 


School-Level  Demographics  for  All  Participants 


School  descriptives 

School  A 
District  1 

School  B 
District  1 

School  C 
District  2 

School  D 
District  3 

School  E 
District  3 

School  F 
District  3 

School  G 
District  3 

Gender 

Male 

43.2% 

42.0% 

43% 

45.2% 

50.6% 

44.9% 

52.2% 

Female 

52.7% 

50.7% 

41.8% 

54.2% 

47.0% 

52.9% 

47.8% 

Missing 

4.1% 

7.2% 

15.2% 

.6% 

2.4% 

2.2% 

.0% 

Race 

Hispanic 

45.42% 

44.72% 

85.81% 

40.82% 

36.9% 

38.11% 

26.07% 

African  American 

7.25% 

7.32% 

8.21% 

9.55% 

10.85% 

1 1 .23% 

5.39% 

Caucasian 

33.21% 

27.63% 

4.48% 

38.01% 

41.22% 

36.56% 

60.0% 

American  Indian 

12.98% 

17.07% 

.75% 

9.53% 

8.28% 

10.57% 

6.74% 

Asian 

.76% 

1.63% 

.75% 

1.12% 

2.56% 

2.87% 

1 .57% 

Pacific  Islander 

.38% 

1.63% 

.0% 

.94% 

.99% 

.66% 

.22% 

Special  education 

14.4% 

20.3% 

12.0% 

8.4% 

10.7% 

8.3% 

6.8% 

Home  language 

English 

20.5% 

52.2% 

53.2% 

36.8% 

43.3% 

29.7% 

65.4% 

Spanish 

74.0% 

39.1% 

30.4% 

57.1% 

47.0% 

58.3% 

29.3% 

Other 

.7% 

1.4% 

1.3% 

3.2% 

5.2% 

6.9% 

3.1% 

Missing 

4.8% 

7.2% 

15.2% 

2.9% 

4.6% 

5.1% 

2.21% 

English  language  learner 

44.5% 

33.3% 

15.8% 

27.4% 

25.6% 

37.3% 

15.4% 

Free  or  reduced-price  meals 

n/a 

n/a 

74.1% 

82.6% 

68.9% 

67.8% 

48.8% 

Participating  students 

n 

146 

69 

158 

310 

328 

276 

324 

% 

9% 

4.2% 

9.7% 

19% 

20.1% 

16.9% 

19.9% 

Participating  teachers 

n 

2 

2 

2 

3 

3 

3 

3 

% 

11.12% 

11.12% 

11.12% 

16.16% 

16.16% 

16.16% 

16.16% 

Note,  n/a  =  not  available. 


academic  language  support  while  they  face  increasingly  demand¬ 
ing  academic  tasks  in  mainstream  middle-school  classrooms.  Fur¬ 
thermore,  students  who  were  exited  from  EL  status  in  middle 
school  were  included  in  the  EL  sample  when  the  variability  and 
subjectivity  of  criteria  used  to  reclassify  ELs  as  English  proficient 
was  considered  (Office  of  Planning,  Evaluation  and  Policy  Devel¬ 
opment,  Policy  &  Program  Studies  Service,  2012).  Additionally, 
on  a  student  survey  administered  at  pretest,  50.5%  of  the  students 
reported  that  a  language  other  than  English  (mostly  Spanish)  was 
spoken  at  home.  Additional  student-level  demographic  informa¬ 
tion  is  displayed  in  Table  2.  Note  that  many  students  identified 
themselves  as  multiracial;  therefore,  the  number  of  participants 
represented  across  racial  categories  adds  up  to  more  than  1,629 
(the  total  number  of  participants). 

Differential  attrition.  Of  the  1,629  participants,  224  did  not 
have  a  posttest  score  on  the  Assessment  of  Social  Studies  Knowl¬ 
edge  (ASK;  all  student  measures  are  described  in  detail  later  in  the 
article),  yielding  an  overall  attrition  rate  of  14.1%  and  a  differen¬ 
tial  attrition  rate  of  1.6%.  On  the  Modified  Assessment  of  Social 
Studies  Knowledge  and  Reading  Comprehension  (MASK),  305 
students  did  not  have  a  posttest  score,  yielding  an  overall  attrition 
rate  of  19.4%  and  a  differential  attrition  rate  of  .2%.  For  the 
Gates-MacGinitie  Reading  Comprehension  Subtest  (4th  ed.; 
MacGinitie,  MacGinitie,  Maria,  Dreyer,  &  Hughes,  2006),  327 
students  did  not  have  a  posttest  score,  yielding  an  overall  attrition 
rate  of  20.1%  and  a  differential  attrition  rate  of  .4%.  To  establish 
whether  differential  attrition  was  evident  across  the  groups,  a 
two-way  analysis  of  variance  was  conducted  on  the  primary  out¬ 
come  variables.  The  factors  in  the  analysis  were  treatment  condi¬ 
tion,  completer  status  at  posttest,  and  the  interaction  of  condition 


and  completer  status.  A  significant  interaction  signifies  systematic 
group  differences  in  the  characteristics  of  students  who  remained 
in  the  study.  Data  revealed  no  significant  condition  by  completer 
status  interaction  effect  (p  values  ranged  from  .13  to  .61).  These 
findings  indicate  that  attrition  among  groups  was  unlikely  to  bias 
the  observed  effects  of  the  intervention. 


Table  2 


Student-Level  Demographics  by  Group 


Comparison 

Treatment 

Student  descriptives 

n 

% 

n 

% 

Gender 

Male 

333 

42.5 

426 

50.4 

Female 

416 

53.1 

385 

45.6 

Missing 

35 

4.4 

34 

4.0 

Race 

Caucasian 

490 

41.11 

482 

37.73 

African  American 

115 

9.65 

105 

8.20 

Hispanic 

467 

39.18 

523 

40.92 

Asian 

14 

1.17 

30 

2.35 

American  Indian 

98 

8.22 

129 

10.10 

Pacific  Islander 

8 

.67 

9 

.70 

Special  education 

65 

8.3 

95 

11.2 

Home  language 

English 

345 

44 

355 

42.0 

Spanish 

363 

46.3 

407 

48.2 

Other 

23 

2.9 

37 

4.4 

Missing 

53 

6.8 

46 

5.4 

English  language  learner 

190 

24.2 

245 

29 

26 
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Intervention  Procedures 

For  approximately  20  weeks,  the  PACT  intervention  was  deliv¬ 
ered  in  treatment  classes  during  regularly  scheduled  social  studies 
time.  During  the  first  6  to  8  weeks,  three  consecutive  units  were 
taught  in  classes  that  were  either  approximately  45  min  daily 
or  took  place  every  other  day  for  90  min.  For  the  next  12  weeks, 
teachers  implemented  only  one  of  the  PACT  components,  the 
knowledge  acquisition  through  text  reading  routine  described  in 
Figure  1,  three  times  a  week  for  approximately  15  min  per  session. 
Students  in  comparison  sections  received  instruction  on  the  same 
content  over  the  same  amount  of  time  as  students  in  the  PACT 
treatment  classes,  but  delivery  of  the  content  differed,  as  teachers 
took  a  business-as-usual  (BAU)  approach  in  the  comparison 
classes. 

Description  of  the  Treatment  Intervention 

The  PACT  intervention  aligned  with  participating  districts’ 
standards  and  the  Common  Core  Standards  (National  Governors 
Association  Center  for  Best  Practices  &  Council  of  Chief  State 
School  Officers,  2010).  Using  the  set  of  instructional  practices 
from  the  previous  PACT  intervention  studies  (e.g.,  Vaughn  et  al., 
2013),  we  integrated  enhancements  for  ELs  (see  the  earlier  section 
Features  of  Effective  Instructional  Practices  for  ELs).  The  treat¬ 
ment  features  three  units,  blending  five  components  that  work 
together  and  complement  one  another.  Following  is  a  description 
of  the  five  PACT  components. 

Comprehension  canopy.  The  comprehension  canopy  starts 
and  guides  every  unit.  It  is  a  10-  to  15-min  routine  to  engage 
students  in  a  purpose  for  reading  while  integrating  new  content  to 
build  students’  background  knowledge.  Teachers  initiate  a  concise 
introduction  and  then  show  a  brief  video  clip  that  provides  students 
with  requisite  background  information  before  encountering  the 
unit  material.  After  students  discuss  video-related  questions  with  a 
partner  and  share  with  the  class,  the  teacher  introduces  an  overar¬ 
ching  comprehension  question  that  is  revisited  and  extended 
throughout  the  unit.  Each  comprehension  canopy  question  is  de¬ 
signed  to  develop  students’  academic  language  in  social  studies  by 
focusing  on  compare  and  contrast,  cause  and  effect,  or  perspective 
taking.  For  example,  students  are  asked,  “How  did  the  colonial 
regions  develop  differently?” 

Essential  words.  Following  the  comprehension  canopy,  five 
key  words  are  introduced  on  the  first  day  of  the  unit.  The  purpose 
of  the  essential  words  component  is  to  teach  the  meaning  of 
concepts  that  are  tightly  connected  with  the  content  and  to  support 
new  learning  by  having  students  engage  with  the  words  in  multiple 
contexts.  Each  word  is  taught  by  using  a  student-friendly  defini¬ 
tion,  visual  representation,  related  words,  examples  and  nonex¬ 
amples,  and  a  turn-and-talk  prompt  that  asks  students  to  discuss  in 
pairs  an  activity  related  to  the  essential  word(s).  As  students  move 
through  the  unit,  they  have  recurrent  exposure  to  each  word  in 
warm-up  activities,  reading  of  varied  texts,  and  team-based  activ¬ 
ities.  Students  are  afforded  multiple  opportunities  to  apply  the 
meaning  of  the  essential  words  and  to  use  them  orally  and  in 
writing.  For  example,  during  a  warm-up  activity,  students  are 
required  to  revisit  an  illustration  of  the  essential  word  mercantilism 
and  to  think  about  how  it  makes  some  people  wealthy.  Students 
then  have  to  answer  questions  about  who  benefits  the  least  from 
mercantilism  and  whether  it  is  fair. 


Knowledge  acquisition  through  text  reading.  Three  times  a 
week,  teachers  lead  students  through  a  critical  reading  routine  that 
lasts  approximately  15  min  and  requires  students  to  read  informa¬ 
tional  text  related  to  the  topic.  Teachers  guide  the  process  by 
providing  a  brief  introduction,  sharing  a  video  clip  or  a  geograph¬ 
ical  map  to  set  up  the  context  for  the  content  to  be  read.  During  this 
introduction,  the  teacher  also  reinforces  the  essential  words  that 
students  will  encounter  and  connects  the  reading  to  the  compre¬ 
hension  canopy  question.  Students  read  the  text  as  a  whole  class 
with  the  teacher,  in  pairs,  in  small  groups,  or  independently. 
Additionally,  students  address  a  variety  of  content-  and  inference- 
based  questions  verbally  and  in  writing  intermittently  throughout 
the  reading. 

TBL  comprehension  check.  TBL  is  based  on  a  university- 
level  practice  adapted  for  use  in  middle-school  classrooms  to 
provide  opportunities  for  text-based  discussions  and  justifications 
for  ideas  (Michaelsen  &  Sweet,  2011).  Two  key  elements  were 
integrated  into  the  TBL  comprehension  checks:  (a)  heterogeneous 
teams  of  students,  and  (b)  a  process  in  which  students  work 
individually  and  with  a  team  to  ensure  accountability  for  learning 
and  understanding  the  content. 

Twice  during  each  unit,  teachers  administer  a  short  comprehen¬ 
sion  check  to  examine  students’  understanding  of  unit  content  and 
to  inform  further  instruction.  This  check  has  10  comprehension 
questions,  with  five  focusing  on  vocabulary.  First,  students  indi¬ 
vidually  complete  the  comprehension  check  and  turn  it  in  to  the 
teacher.  The  teacher  monitors  individual  students’  comprehension 
of  content  through  this  initial  comprehension  check.  Next,  students 
complete  the  same  comprehension  check  as  a  two-person  team,  but 
using  their  texts  and  notes  during  this  second  round  to  justify  their 
answers.  Students  use  scratch-off  cards  to  mark  their  answers  and 
receive  immediate  feedback  on  accuracy.  A  correct  answer  reveals 
a  star.  If  the  team  answer  is  incorrect,  the  team  revisits  notes  and 
text,  discusses,  and  selects  an  alternative  answer  supported  with 
text  evidence.  Finally,  the  teacher  provides  whole-class  targeted 
instruction  to  address  gaps  in  students’  understanding. 

TBL  knowledge  application.  Knowledge  application  in 
PACT  requires  teams  to  apply  the  newly  learned  content  of  the 
unit  through  a  problem-solving  activity — for  example,  addressing 
a  question  such  as,  “What  might  have  happened  to  prevent  the 
Revolutionary  War?”  At  the  conclusion  of  every  unit,  teams  of 
four  students  participate  in  a  discussion  that  is  facilitated  by 
sharing  ideas  and  using  text  evidence  to  address  the  task  assigned. 
Students  must  listen  to  team  members’  contributions  and  think 
critically  before  presenting  a  response  to  the  class.  The  teacher 
monitors  progress  while  students  work  in  their  teams,  provides 
feedback  to  teams  as  they  demonstrate  their  understanding  of  the 
content,  and  facilitates  students’  extended  thinking  about  the  con¬ 
tent  and  evidence.  At  the  end  of  the  activity,  student  teams  share 
their  responses  and  reasons  with  the  class.  Moreover,  the  teacher 
ends  the  activity  by  connecting  the  knowledge  application  work  to 
the  comprehension  canopy  question  that  started  the  unit.  The 
teacher  synthesizes  key  information  learned  over  the  entire  unit 
and  prepares  the  class  for  an  end  of  unit  assessment. 

Implementation  Support 

The  research  team  provided  professional  development  to  partic¬ 
ipating  teachers  during  two  sessions  and  provided  ongoing  in-class 
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support  in  treatment  classes.  Prior  to  launching  the  intervention,  a 
1  -day  professional  development  workshop  trained  teachers  to  im¬ 
plement  the  intervention  in  treatment  classes  and  stressed  the 
significance  of  using  the  PACT  components  in  treatment  classes 
exclusively.  The  training  day  was  devoted  to  (a)  providing  an 
overview  and  explaining  the  design  of  the  study,  (b)  discussing 
relevant  research  in  reading  comprehension  and  content  teaching 
and  learning,  (c)  explaining  and  modeling  each  PACT  component, 
and  (d)  allowing  hands-on  practice  with  Unit  1  lessons  and  mate¬ 
rials.  (The  Meadows  Center  for  Preventing  Educational  Risk, 
2013).  An  additional  3-hr  professional  development  session  was 
held  for  teachers  after  the  completion  of  the  first  unit.  During  this 
second  session,  research  support  personnel  reviewed  the  elements 
of  the  PACT  components  and  discussed  areas  for  improving  PACT 
implementation. 

Each  teacher  was  assigned  one  research  support  person,  who 
provided  in-class  support  as  needed  throughout  the  units  for  the 
first  6  to  8  weeks.  Research  personnel  filled  various  roles  to  ensure 
high  levels  of  PACT  implementation.  During  implementation  of 
the  first  unit,  research  personnel  were  present  in  treatment  classes 
daily  and  modeled  the  first  occurrence  of  each  PACT  component 
(e.g.,  comprehension  canopy,  essential  words).  Research  personnel 
scaled  back  their  presence  in  treatment  classes  to  two  to  three 
times  a  week  during  the  second  unit,  and  to  one  to  two  times  a 
week  during  the  third  unit.  They  made  more  visits  to  teachers  who 
required  further  coaching  and  feedback.  Personnel  also  answered 
teachers’  questions  regarding  PACT  and  assisted  them  in  integrat¬ 
ing  the  PACT  components  into  their  instructional  planning  for 
their  treatment  classes.  During  the  12  weeks  that  teachers  contin¬ 
ued  to  implement  the  knowledge  acquisition  through  text  reading 
routine,  research  personnel  visited  with  teachers  three  to  four  times 
to  keep  track  of  the  readings. 

Implementation  Fidelity 

Fidelity  data  were  collected  by  means  of  audio  recordings  in 
all  participating  teachers’  classes  throughout  the  implementa¬ 
tion  of  the  PACT  intervention  to  measure  adherence  to  the 
PACT  components  in  treatment  classes  and  determine  whether 
any  components  were  present  in  comparison  classes.  Before  the 
intervention  started,  a  database  manager  on  the  research  team 
randomly  selected  one  treatment  and  one  BAU  class  period  to 
be  recorded  per  teacher.  Next,  each  teacher  audio  recorded  the 
randomly  selected  class  periods  daily  for  the  duration  of  the 
three  10-day  units  of  intervention  in  the  identified  classes.  Each 
teacher  submitted  approximately  30  treatment  audio  recordings 
from  one  class  period  and  30  BAU  audio  recordings  from 
another  to  be  coded  by  the  research  team.  Research  personnel 
then  coded  two  units  (about  18  recordings)  of  instruction  for 
each  condition  per  teacher.  Figure  1  provides  the  number  of 
opportunities  to  observe  each  PACT  component.  For  example, 
in  two  units,  the  comprehension  canopy  is  observed  two  times 
per  teacher;  totaling  across  18  teachers,  this  provides  36  pos¬ 
sible  opportunities.  Other  total  possible  opportunities  are  as 
follows:  warm-up  =  144;  TBL  comprehension  check  =  72; 
essential  words  =  36;  knowledge  acquisition  through  text  read¬ 
ing  =  108;  and  TBL  knowledge  application  =  36. 

The  fidelity  measure  used  to  code  the  audio  recordings  aligned  with 
the  PACT  components:  (a)  comprehension  canopy,  (b)  essential 


words,  (c)  warm-up,  (d)  knowledge  acquisition  through  text  reading, 
(e)  TBL  comprehension  check,  and  (f)  TBL  knowledge  application. 
Coders  rated  the  extent  to  which  an  individual  teacher  implemented 
required  elements  for  each  component,  using  a  scale  from  1  to  4,  with 
4  indicating  that  the  teacher  completed  all  of  the  expected  aspects  of 
the  component.  If  the  component  was  not  required  or  expected  for  the 
day,  a  not  applicable  (0)  rating  was  assigned. 

As  in  prior  PACT  studies,  interrater  reliability  on  the  fidelity 
instrument  was  established  by  using  a  gold-standard  method 
(Gwet,  2012).  A  senior  member  of  the  research  team  held  a  3-hr 
training  session  on  the  use  of  the  fidelity  instrument  for  a  team  of 
six  research  support  personnel.  The  team  examined  the  codebook 
and  coding  form,  reviewed  indicators  of  teacher  behaviors  associ¬ 
ated  with  each  PACT  component,  discussed  comparison  class 
coding,  and  practiced  coding  with  videos  and  audio  recordings. 
Two  senior  researchers  then  served  as  the  gold  standard  and  coded 
a  set  of  treatment  and  comparison  audio  recordings.  The  coders 
individually  coded  the  same  audio-recorded  lessons,  using  the 
fidelity  instrument,  and  additional  audio  recordings  were  coded 
until  interrater  agreement  of  90%  or  higher  was  reached.  To  avoid 
observer  drift,  the  coding  team  reestablished  reliability  coding  with 
two  additional  interrater  checkpoints,  using  the  same  interrater 
agreement  of  90%  or  higher.  Although  simple  percent  agreement 
is  popular,  it  can  be  inflated  because  of  chance  (Hintze,  2005). 
Cohen’s  kappa  (k)  is  a  more  conservative  measure  of  interrater 
agreement  in  that  it  takes  into  account  chance  agreement  (Landis 
&  Koch,  1977;  Suen  &  Ary,  1989).  Coefficients  can  range 
from  — 1.0  to  1.0.  The  interrater  reliability  for  the  raters  was  found 
to  be  k  =  0.87  (p  <  .001),  95%  confidence  interval  [0.63,  1.0], 
which  is  considered  “substantial”  (Gelfand  &  Hartmann,  1975; 
Landis  &  Koch,  1977). 

Table  3  presents  fidelity  data  for  each  of  the  components  in 
treatment  and  comparison  classes.  In  treatment  classes,  PACT 
components  were  implemented  with  generally  high  levels  of  fidel¬ 
ity.  Teachers  struggled  the  most  with  implementing  the  knowledge 
acquisition  through  text  reading  component,  with  about  25%  of  the 
component  implemented  with  a  low  or  mid-low  rating.  Nonethe¬ 
less,  overall,  teachers  delivered  the  PACT  intervention  with  con¬ 
sistently  high  levels  of  procedural  fidelity. 

Coders  rated  audio  recordings  of  BAU  instruction  in  compari¬ 
son  classes  by  using  the  same  protocol  used  for  treatment  audio 
recordings  to  determine  whether  there  was  contamination  of  the 
BAU  comparison  condition.  As  displayed  in  Table  3,  differential 
instruction  for  treatment  and  control  students  with  respect  to 
PACT  was  accomplished.  Research  support  personnel  frequently 
stressed  the  importance  of  avoiding  PACT  spillover  into  BAU 
class  sections  with  participating  teachers,  and  coders  identified 
limited  evidence  of  PACT  in  the  BAU  sections.  Elements  of  the 
warm-up,  knowledge  acquisition,  and  essential  words  routines 
appeared  in  some  BAU  audio  recordings,  but  at  low  rates  com¬ 
pared  with  treatment  classes.  For  example,  in  occurrences  of 
knowledge  acquisition  through  reading  in  comparison  classes,  a 
teacher  might  assign  independent  reading  without  some  of  the 
PACT  reading  routine  elements,  like  providing  context  for  the 
reading  through  video  and  making  connections  to  essential  unit 
vocabulary.  Similarly,  warm-up  activities  observed  in  BAU 
recordings  rarely  connected  to  and  reinforced  the  unit  content 
like  PACT  warm-up  activities  did.  However,  the  PACT  inter¬ 
vention  was  also  made  up  of  less  common  components,  such  as 


28 


VAUGHN  ET  AL. 


10-Day  Lesson  Cycle 


PACT 


Minutes  Lesson  I  |  Lesson  2  Lesson  3  j  Lesson  4  j  Lesson  5  Lesson  6  Lesson  7  j  Lesson  8  Lesson  9  j  Lesson  10 
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(Video) 


Warm-Up 


Warm-Up  W3rm-Up 


Warm-Up 
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15 
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20 
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Through  Text 
Reading 

Through  Text 
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|  Through  Text 
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25 
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30 
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Routine 

35 

'  .  ..... 

TBL 

Comprehension 
Check  1 

TBL 

Comprehension 
Check  2 

40 

:-;,v  ; 

■ 

45 

TBL 

Knowledge 

Application 


End  of  Unit 
Test 


Key:  Ml  Comprehension  Canopy  Lj  Knowledge  Acquisition  Through  Text  Reading  HI  Team-Based  Learning  Knowledge  Application 

Essential  Words  Team-Based  Learning  Comprehension  Check  Hi  End  of  Unit  Test 

Figure  1.  Frequency  and  duration  of  each  PACT  component  during  one  unit.  The  PACT  instructional 
components  are  embedded  into  the  teachers’  content  instruction.  PACT  =  Promoting  Adolescents’  Compre¬ 
hension  of  Text;  TBL  =  Team  Based  Learning;  ELs  =  English  Learners. 


TBL  comprehension  check,  which  resembles  a  quiz.  This  com¬ 
ponent  was  rarely  detected  in  BAU  audio  recordings  because 
some  of  the  elements  exclusive  to  PACT,  such  as  students 
working  with  a  team  to  justify  their  answers  and  receiving 
immediate  feedback  from  scratch-off  forms,  were  almost  never 
observed  in  BAU  classes.  This  trend  aligns  with  findings  from 
prior  PACT  studies  (Vaughn  et  al.,  2013,  2015),  which  also 
reported  intervention  components  observed  to  a  minimal  degree 
in  comparison  classrooms. 

Student  Measures 

The  same  measures  of  impact  employed  in  the  prior  two 
PACT  studies  (Vaughn  et  al.,  2013,  2015)  were  used.  Research 
personnel  uninformed  of  the  condition  to  which  students  were 
assigned  administered  all  three  measures  to  students  in  the 
treatment  and  comparison  groups  prior  to  and  immediately 
following  treatment. 

Gates-MacGinitie  Reading  Comprehension  Subtest  The 

Gates-MacGinitie  Reading  Comprehension  Subtest  (4th  ed.; 
MacGinitie  et  al.,  2006)  is  a  group-administered,  timed  (35  min) 
assessment  of  reading  comprehension.  The  assessment  consists  of 


expository  and  narrative  passages  ranging  in  length  from  three  to  15 
sentences.  Students  read  each  passage  silently  and  answer  three  to  six 
multiple-choice  questions  related  to  the  most  recently  read  passage. 
As  students  progress  through  the  assessment,  items  increase  in  diffi¬ 
culty.  Internal  consistency  reliability  ranges  from  0.91  to  0.93,  and 
alternate  form  reliability  is  reported  as  0.80  to  0.87. 

ASK.  The  researcher-developed  ASK  (Vaughn  et  al.,  2013)  is  a 
42-item,  four-option,  untimed  multiple-choice  test  that  measures  con¬ 
tent  knowledge  in  the  three  units  that  composed  the  intervention 
(Colonial  America,  Road  to  Revolution,  and  The  American  Revolu¬ 
tion).  Items  with  known  difficulty  parameters  were  collected  with 
permission  from  released  state  and  advanced  placement  social  studies 
tests  from  Texas,  Massachusetts,  and  The  College  Board.  Researcher- 
developed  vocabulary  items  were  also  included  in  the  item  set.  The 
ASK  was  administered  at  pretest  and  posttest. 

The  items  for  the  ASK  were  selected  after  a  series  of  pilot  tests 
to  validate  the  provided  difficulty  parameters,  refine  the  instruc¬ 
tions  for  test  administrators,  and  estimate  the  amount  of  time 
necessary  for  administration.  The  final  items  were  selected  follow¬ 
ing  a  series  of  item-level  confirmatory  factor  analyses  to  evaluate 
model  fit  and  estimate  item  parameters  (Vaughn  et  al.,  2013; 
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Table  3 


Frequency  of  Fidelity  Observations  in  Treatment  and  Comparison  Classrooms 


Fidelity  rating  scale 

CC 

{N  =  31) 

% 

WU 

(N  =  1 30) 

% 

TBLC 
(N  =  64) 

% 

EW 

(V  =  32) 

% 

KA 

( N  =  108) 

% 

TBLK 
(N  =  34) 

% 

Treatment  classes 

4  =  high 

15 

48.4 

105 

80.8 

12 

18.8 

17 

53.1 

7 

6.5 

5 

14.7 

3  =  mid-high 

11 

35.5 

6 

4.6 

42 

65.6 

11 

34.4 

74 

68.5 

27 

79.4 

2  =  mid-low 

4 

12.96 

15 

11.5 

3 

4.7 

3 

9.4 

21 

19.4 

2 

5.9 

1  =  low 

1 

3.2 

4 

3.1 

7 

10.9 

1 

3.1 

6 

5.6 

0  =  not  observed 

CC 

WU 

TBLC 

EW 

KA 

TBLK 

(N  =  274) 

% 

(N  =  274) 

% 

(N  =  274) 

% 

(V  =  274) 

% 

( N  =  274) 

% 

(N  =  274) 

% 

Comparison  classes 

4  =  high 

26 

9.5 

2 

.7 

3  =  mid-high 

2 

.7 

12 

4.4 

2  =  mid-low 

8 

2.9 

39 

14.2 

2 

.7 

6 

2.2 

34 

12.4 

3 

1.1 

1  =  low 

11 

4.0 

0  =  not  observed 

266 

97.1 

209 

76.3 

272 

99.3 

264 

96.4 

217 

79.2 

271 

98.9 

Note.  CC  —  comprehension  canopy;  WU  —  warm-up;  TBLC  =  team-based  learning  comprehension  check;  EW  =  essential  words;  KA  =  knowledge 
acquisition  through  text  reading;  TBLK  =  team-based  learning  knowledge  application. 


Wanzek  et  al.,  2015).  The  Cronbach’s  alpha  with  the  current 
sample  was  .93. 

MASK  and  reading  comprehension.  The  MASK  (Vaughn  et 
al.,  2015)  is  a  21-item,  four-option,  untimed  multiple-choice  test 
that  measures  reading  comprehension  in  the  content  area.  The 
assessment  consists  of  three  reading  passages  drawn  from  the  ASK 
(Vaughn  et  al.,  2013,  2015)  used  in  previous  PACT  studies,  but 
with  altered  Lexile  levels.  For  this  assessment  version,  the  Lexile 
range  was  1,090  to  1,140,  and  the  word  count  range  was  312  to 
349.  Each  passage  is  related  to  content  covered  in  the  three  10-day 
cycles.  Students  read  each  passage  silently  and  immediately  an¬ 
swer  seven  multiple-choice  questions  about  the  passage.  Reading 
comprehension  items  were  researcher  developed  and  measured 
students’  ability  to  identify  main  ideas,  understand  vocabulary  in 
context,  identify  cause  and  effect,  and  summarize.  The  MASK  was 
administered  at  pretest  and  posttest.  The  alpha  coefficient  with  the 
current  sample  was  .92. 

Results 

We  fit  a  series  of  three-level  regression  models  to  estimate 
parameters  and  evaluate  differences.  Students  were  nested  in 
classes,  and  classes  were  nested  in  teachers.  Classes  were  random¬ 
ized  to  condition  within  teachers.  The  following  is  the  reduced 
form  equation  for  the  model; 

Reading  Outcome ijk  =  yooo  +  y0io  *  (Intervention),*  +  y020  *  (Class  %  of  ELs),* 

+  7030  *  (Intervention  *  Class  %  of  ELs),* 

+  7  ioo  *  (Pretest-gm),-,*  +  y20o  *  (EL);,* 

+  7300  *  (Intervention  *  EL),-,* 

+  7400  *  (EL  *  Class  %  of  ELs),,* 

+  750o  *  (Intervention  *  EL  *  Class  %  of  ELs),-,*  +  r0j*  +  um 
+  M0i*(Intervention),*  +  eijk 

Reading  Outcome iJk  represents  the  posttest  score  for  student  i  in 
class  j  taught  by  teacher  t.  In  similar,  previous  analyses  (Vaughn  et 
al.,  2013;  Vaughn  et  al.,  2015)  with  MASK  data  and  with  data 


from  the  related  ASK,  we  included  students’  item-level  responses 
in  the  measurement  part  of  a  structural  equation  model,  estimating 
latent  outcome  scores  in  a  one-parameter  item-response  model. 
Given  the  complexity  of  the  models  in  this  article,  we  use  raw  data 
for  the  ASK  and  MASK.  Sensitivity  analyses  that  use  these  earlier 
data  suggest  no  differences  in  the  direction  of  findings  when  using 
raw  versus  latent  scores.  However,  differences  in  precision  across 
the  two  approaches  may  be  present,  with  the  latent  scoring  model 
being  more  precise.  For  the  Gates-MacGinitie  test,  we  used  stan¬ 
dard  scores.  We  dummy-coded  Intervention  and  EL,  with  the 
comparison  condition  and  the  non-EL  group  at  0;  treatment  con¬ 
dition  and  EL  are  coded  as  1.  We  model  Class  %  of  ELs  as  the 
percentage  of  ELs  in  each  class.  Its  distribution  is  relatively 
normal,  though  skewed  to  the  right  {M  =  28.5,  SD  —  19.5; 
skewness  =  .90).  We  centered  these  data  on  the  10%  threshold. 
We  did  not  use  the  variable’s  natural  scale  because  zero  is  outside 
the  logical  range  of  the  moderator  (Bauer  &  Curran,  2005).  We 
also  did  not  transform  the  data  because  the  variable  has  substantive 
meaning  in  its  natural  form  and  because  Class  %  of  ELs  is  the  focal 
moderator  and  the  basis  for  interpreting  the  results.  Finally,  we 
also  did  not  center  on  the  moderator’s  mean  ( M  =  28.5)  because 
the  data  are  somewhat  skewed.  Instead,  we  centered  Class  %  of 
ELs  at  about  one  standard  deviation  below  its  mean  (Francis  & 
Vaughn,  2009).  We  used  this  value  (10%  of  ELs  in  the  class)  to 
interpret  main  effects  and  interaction  effects  and  to  calculate 
regions  of  statistical  significance  along  the  continuum  of  values  for 
Class  %  of  ELs. 

Pretest  scores  on  the  ASK,  MASK,  and  Gates-MacGinitie  are 
included  as  grand-mean  centered  covariates.  Intervention  and 
Class  %  of  ELs  are  modeled  at  Level  2  (class  level),  as  is  the 
two-way  interaction  involving  Intervention  and  Class  %  of  ELs. 
Other  two-way  interactions  and  the  three-way  interaction  (all  in¬ 
volving  EL)  are  modeled  at  Level  1  (student  level).  Class-level 
residuals  are  modeled  as  r0jk,  and  student-level  residuals  are  mod¬ 
eled  as  eijk.  Classes  are  randomized  to  condition  within  teachers, 
and  u00k  +  u01k  ( Inter\>ention)jk  represents  class-level  residuals 
within  teachers  for  the  treatment  and  comparison  groups.  Effect 
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sizes  for  Intervention  and  EL  status  are  estimated  as  the  ratio  of  the 
model-derived  coefficient  for  Intervention  (or  EL  status)  and  the 
pooled  within-group  standard  deviation  across  conditions  (or  EL 
status)  at  posttest. 

Table  4  summarizes  pretest  and  posttest  means,  standard  devi¬ 
ations,  and  observed  score  ranges  for  the  Reading  Outcomes  by 
Intervention  and  by  EL  status.  Table  5  summarizes  model  param¬ 
eters  for  each  of  the  three  reading  outcomes.  The  reader  should  use 
values  in  Table  5  to  interpret  main  effects  and  interaction  effects, 
rather  than  Table  4.  The  value  for  any  given  parameter  in  the 
model  is  conditional  on  the  other  parameters  in  the  model.  For 
example,  the  main  effect  for  treatment  in  Table  5  is  the  effect  for 
ELs  in  classes  with  10%  ELs  ( Class  %  ofELs).  Predicted  treatment 
effects  for  ELs  and  non-ELs  at  other  values  of  Class  %  ofELs  can 
be  calculated  and  plotted  (as  described  later),  but  they  are  based  on 
the  equation  described  in  the  earlier  paragraph  and  have  to  be 
interpreted  in  terms  of  the  codes  for  EL  status  (i.e.,  0/1)  and  for 
treatment  condition. 

The  intercept  (y000)  f°r  each  model  is  the  mean  posttest  score 
when  all  predictors  are  at  zero  or  at  their  centered  value  if  other 
than  zero.  Coefficients  for  the  main  effects  of  Intervention  (y010), 
EL  status  (y200),  and  Class  %  of  ELs  (y020)  in  a  model  with  a 
three-way  interaction  represent  deviations  from  this  intercept  value 
(Hoffman,  2014).  Our  interest  is  in  the  main  effect  of  treatment  on 
learning  and  reading  outcomes  and  the  moderating  influence  of  EL 
status  and  Class  %  of  ELs.  We  interpret  the  results  accordingly. 

For  the  ASK,  the  estimated  coefficient  for  Intervention  ( y010  = 
3.58,  p  <  .01,  Effect  Size  (ES)  =  .40)  represents  the  amount  by 
which  the  intercept  ( y000  =  21.74)  increases  for  non-ELs  in 
treatment  classes  with  10%  ELs.  The  predicted  scores  for  ELs  and 
non-ELs  differ  as  well.  The  effect  of  EL  status  (y200  =  “2.65,  p  < 


.001,  ES  =  -.31)  means  that  ELs  in  comparison  classrooms  with 
10%  ELs  scored  2.65  points  lower  than  non-ELs  in  the  same 
comparison  classroom.  Among  the  two-way  interactions  involving 
Intervention,  the  coefficient  for  the  Intervention  X  EL  term  is 
positive  and  differs  statistically  from  zero  ( y300  =  2.35,  p  <  .05), 
meaning  that  the  difference  in  knowledge  acquisition  between  ELs 
and  non-ELs  is  significantly  smaller  (less  negative,  in  this  case)  in 
treatment  classes  with  10%  ELs  than  in  comparison  classes  with 
10%  ELs  (again,  in  the  context  of  significant  three-way  interac¬ 
tion).  The  significant  three-way  interaction  ( y500  =  -.08,/?  <  .01) 
means  that  the  regression  coefficient  for  the  interaction  of  Inter¬ 
vention  and  EL,  y300  =  2.35,  is  conditional  on  values  of  Class  % 
of  ELs.  The  coefficient  or  slope  for  the  two-way  interaction 
changes  by  —.08  units  for  every  change  in  %  of  Class  EL.  Another 
way  of  saying  this  is  that  the  EL/non-EL  difference  in  treatment 
classes  widens  as  EL  becomes  more  prevalent  in  a  class. 

To  probe  the  three-way  interaction,  we  plotted  predicted  scores 
for  the  four  groups  defined  by  the  interaction  of  Intervention  and 
EL  status  across  values  of  Class  %  of  ELs  (see  Figure  2).  The 
reader  can  find  the  corresponding  y  value  (adjusted  posttest  score 
on  the  ASK)  for  a  given  value  of  x  ( Class  %  ofELs)  for  any  or  all 
of  the  four  groups  (i.e.,  EL  in  treatment,  EL  in  comparison,  non-EL 
in  treatment,  and  non-EL  in  comparison).  To  evaluate  the  moder¬ 
ating  effect  of  Class  %  of  ELs  on  variation  in  the  Intervention  X 
EL  status  interaction,  we  calculated  regions  of  significance  along 
Class  %  of  ELs  based  on  fixed-effects  estimates  and  their  associ¬ 
ated  covariance  matrices,  as  described  by  Bauer  and  Curran 
(2005).  The  regions  of  significance  along  Class  %  of  ELs  are 
bounded  by  8.8  and  11.48.  This  means  that  the  difference  between 
ELs  and  non-ELs  in  treatment  classes  is  significantly  smaller  than 
the  difference  between  ELs  and  non-ELs  in  comparison  classes  for 


Table  4 

Pretest  and  Posttest  Means,  Standard  Deviations,  and  Ranges  for  Reading  Outcomes  by  English  Learner  Status  and 
Treatment  Condition 


Pretest 

Posttest 

Measures 

M 

SD  n 

Range 

M 

SD  n 

Range 

ASK 

Non-EL 


Comparison 

17.12 

6.87 

504 

1-38 

22.14 

8.27 

474 

1-41 

Treatment 

17.67 

7.56 

507 

0-37 

25.97 

8.98 

477 

6-42 

EL 

Comparison 

14.13 

5.43 

173 

4-37 

16.59 

7.07 

163 

4-37 

Treatment 

14.76 

6.15 

232 

2-37 

19.93 

8.86 

223 

4-39 

MASK 

Non-EL 

Comparison 

8.40 

4.43 

492 

1-20 

10.41 

5.07 

446 

0-20 

Treatment 

8.08 

4.77 

511 

0-20 

10.66 

5.17 

449 

1-20 

EL 

Comparison 

5.66 

3.34 

161 

1-17 

8.09 

4.09 

144 

1-19 

Treatment 

Gates-MacGinitie 

6.28 

4.09 

235 

0-20 

8.20 

4.67 

* 

204 

0-20 

Non-EL 

Comparison 

95.08 

14.41 

495 

65-135 

96.48 

13.69 

461 

65-135 

Treatment 

96.22 

15.58 

508 

65-135 

97.66 

15.63 

465 

65-135 

EL 

Comparison 

83.47 

13.36 

173 

65-131 

87.98 

12.92 

150 

65-131 

Treatment 

87.24 

14.31 

236 

65-135 

88.70 

14.02 

199 

65-131 

Note.  ASK  =  Assessment  of  Social  Studies  Knowledge;  EL  =  English  language  learner;  MASK  =  Modified  Assessment  of  Social  Studies  Knowledge 
and  Reading  Comprehension;  Gates-MacGinitie  =  Gates-MacGinitie  Reading  Comprehension  Subtest.  6 
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Table  5 

Multilevel  Model  Results  for  Outcomes 


Measure 

Predictor 

b 

SE 

ES 

Assessment  of  Social  Studies 

Intercept 

21.74’** 

.59 

Knowledge 

Pretest 

y  g*** 

.03 

Intervention 

3.58** 

1.27 

.40 

EL 

-2.65*** 

.90 

-.31 

Class  %  of  ELs 

-.04 

.03 

Intervention  X  EL 

2.35* 

1.04 

Intervention  X  Class  %  of  ELs 

-.01 

.05 

ELL  X  Class  %  of  ELs 

.02 

.03 

Intervention  X  EL  X  Class  %  of  ELs 

-.08** 

.03 

Modified  Assessment  of  Social 

Intercept 

10.34*** 

.34 

Studies  Knowledge  and 

Pretest 

.62*** 

.04 

Reading  Comprehension 

Intervention 

1.00* 

.41 

.20 

EL 

-1.26* 

.54 

-.26 

Class  %  of  ELs 

-.06*** 

.02 

Intervention  X  EL 

.35 

.91 

Intervention  X  Class  %  of  ELs 

-.02 

.02 

EL  X  Class  %  of  ELs 

.06*** 

.02 

Intervention  X  EL  X  Class  %  of  ELs 

-.04 

.03 

Gates-MacGinitie  Reading 

Intercept 

96.17*** 

1.02 

Comprehension  Subtest 

Pretest 

.60*** 

.03 

Intervention 

1.71 

1.48 

.12 

EL 

-1.21 

1.81 

-.08 

Class  %  of  ELs 

-.10** 

.04 

Intervention  X  EL 

-1.21 

1.81 

Intervention  X  Class  %  of  ELs 

-.08 

.06 

EL  X  Class  %  of  ELs 

.03 

.06 

Intervention  X  EL  X  Class  %  of  ELs 

.04 

.06 

Note.  SE  =  Standard  Error;  EL  =  English  Learner;  ES  =  Effect  Size. 
*  p  <  .05.  **  p  <  .01.  ***  p  <  .001. 


values  of  Class  %  ofELs  below  8.8%.  Between  8.8%  and  11.48%, 
the  difference  for  ELs  and  non-ELs  in  comparison  and  treatment 
classes  does  not  differ  from  zero.  Above  11.48%,  the  difference 
between  ELs  and  non-ELs  in  comparison  classes  is  significantly 
smaller  than  the  difference  between  ELs  and  non-ELs  in  the 
treatment  classes,  and  the  difference  is  increasingly  smaller  as 
Class  %  of  ELs  increases. 

For  the  MASK,  the  coefficient  for  the  three-way  interaction 
term  is  not  statistically  significant  (y500  =  —.04);  however,  we 
report  the  effects  for  the  full  model  (i.e.,  the  model  with  the 
three-way  interaction  term)  so  that  parameter  estimates  can  be 
compared  across  the  three  reading  outcomes,  and  we  interpret  only 
the  main  effects.  The  main  treatment  effect  on  the  MASK  differs 
statistically  from  zero  (y070  =  1 .00,  p  <  .05,  ES  =  .20),  meaning 
that  non-ELs  in  treatment  classrooms  with  10%  ELs  scored  about 
1  point  higher  on  the  posttest  than  non-ELs  in  comparison  class¬ 
rooms  with  10%  ELs.  The  significant  effect  for  EL  status 
(■y 200  =  —  1 .26,  p  <  .05,  ES  =  -.26)  suggests  that  ELs  performed 
worse  than  non-ELs  in  comparison  classes  with  10%  ELs,  and  the 
significant  effect  for  Class  %  of  ELs  (y 020  =  — -06,  p  <  .001) 
indicates  that  MASK  scores  for  non-ELs  in  comparison  classes 
decrease  by  .06  points  for  each  additional  percentage  point  on 
Class  %  of  ELs.  On  the  Gates-MacGinitie  test,  only  the  main 
effect  of  Class  %  ofELs  (y020  =  -.10,  p  <  .01)  is  significant  and 
negative,  similar  to  the  trend  on  the  MASK. 

Discussion 

This  study  investigated  the  efficacy  of  the  PACT  set  of  instruc¬ 
tional  practices  (Vaughn  et  al.,  2013)  adjusted  to  meet  the  needs  of 


ELs  in  eighth-grade  social  studies  classes.  We  hypothesized  that 
students  who  were  not  ELs  would  perform  similarly  to  students  in 
previous  studies  (Vaughn  et  al.,  2013,  2015).  We  also  hypothe¬ 
sized  that  ELs  who  received  the  PACT  treatment  would  outper¬ 
form  ELs  in  the  BAU  comparison  condition.  In  sum,  we  believed 
that  PACT  enhanced  with  instructional  practices  for  ELs  would 
positively  affect  all  learners  participating  in  the  treatment  condi¬ 
tion. 

These  hypotheses  were  confirmed.  Both  ELs  and  non-ELs  who 
received  the  treatment  outperformed  those  assigned  to  the  BAU 
comparison  condition  on  measures  of  content  knowledge  acquisi¬ 
tion  (ES  =  0.40)  and  content-related  reading  comprehension 
(ES  =  0.20).  We  interpret  these  findings  as  particularly  impactful 
because  there  is  complete  overlap  in  the  content  taught  to  the 
treatment  condition  and  the  BAU  comparison  condition,  with  the 
only  variation  being  the  manner  in  which  the  content  was  taught. 
Furthermore,  because  randomization  was  at  the  class  level,  teacher 
effects  were  controlled,  allowing  for  students  in  the  treatment  and 
comparison  conditions  to  have  the  same  teacher.  This  design 
provides  a  challenging  test  to  the  treatment,  increasing  confidence 
in  the  effect  for  content  knowledge  acquisition.  In  addition,  these 
findings  align  with  those  of  prior  studies  of  PACT’S  efficacy  with 
general  populations  of  students,  which  reported  effect  sizes  of  0.17 
(Vaughn  et  al.,  2013)  and  0.32  (Vaughn  et  al.,  2015)  on  content 
knowledge  acquisition,  and  an  effect  size  of  0.29  (Vaughn  et  al., 
2013)  on  content-related  reading  comprehension.  We  administered 
a  standardized  reading  comprehension  measure  to  determine 
whether  there  were  differential  effects  on  reading  comprehension 
for  students  in  the  treatment  or  control  condition.  We  did  not 
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Figure  2.  Predicted  Assessment  of  Social  Studies  Knowledge  scores  for  the  four  groups  as  a  function  of  class 
percentage  (%)  of  ELs.  Class  %  of  ELs  =  percentage  of  students  in  each  class  who  are  ELs;  C_non-ELs  = 
non-ELs  randomized  to  the  comparison  group;  C_ELs  =  ELs  randomized  to  the  comparison  condition; 
T_non-ELs  =  non-ELs  assigned  to  treatment;  T_ELs  =  ELs  randomized  to  treatment. 


hypothesize  differences,  based  on  findings  from  previous  PACT 
studies;  however,  we  also  wanted  to  ensure  that  participating  in  the 
PACT  treatment  did  not  negatively  affect  target  students. 

We  further  hypothesized  that  the  benefit  of  PACT  would  vary, 
depending  on  the  class-level  prevalence  of  English  academic  lan¬ 
guage,  which  we  defined  as  the  percent  of  ELs  in  the  classroom. 
Specifically,  we  assumed  that  increases  in  the  class-level  percent¬ 
age  of  ELs  would  disadvantage  both  ELs  and  non-ELs  because 
sophisticated  content-related  English  academic  language  is  less 
available  to  all  students  under  such  conditions.  Our  rationale  was 
that  overreliance  on  discourse-based  practices  among  peers  whose 
language  and  vocabulary  use  in  English  were  still  developing  would 
reduce  the  overall  effects  of  the  treatment.  Our  results  suggest  that  the 
content  knowledge  acquisition  outcomes  for  ELs  and  non-ELs  are 
more  similar  in  treatment  classes  with  10%  ELs  than  in  BAU  classes 
with  10%  ELs.  In  other  words,  at  the  >0%  to  10%  EL  threshold,  ELs 
and  non-ELs  respond  comparably  to  the  PACT  intervention  on  a 
measure  of  content  knowledge  acquisition. 

As  predicted,  this  relative  advantage  for  ELs  in  PACT  dimin¬ 
ishes  as  classrooms  become  increasingly  diverse  (increased  levels 
of  Class  %  of  ELs).  Trends  for  content  knowledge  acquisition 
across  values  of  Class  %  of  ELs  decline  for  all  groups.  However, 
the  outcomes  for  ELs  and  non-ELs  in  treatment  classrooms  be¬ 
come  significantly  less  similar  than  the  relative  outcomes  for  ELs 
and  non-ELs  in  BAU  as  classes  become  increasingly  language 
diverse.  In  other  words,  as  the  percentage  of  ELs  in  a  class 
increases,  performance  on  the  content  knowledge  acquisition  mea¬ 
sure  decreases  for  all  students  but  more  dramatically  for  ELs  than 
for  non-ELs.  This  moderating  effect  for  Class  %  of  ELs  begins  at 
about  12%  ELs  and  continues  across  the  range  of  Class  %  of  ELs. 
This  finding  suggests  that  the  influence  of  PACT  may  depend 
partly  on  the  quality  of  classroom  discourse  and  that  ELs  are 
increasingly  more  disadvantaged  than  non-ELs  in  PACT-like  in¬ 


terventions  when  English  academic  language  is  less  accessible  or 
less  often  used  by  one’s  classmates. 

The  question  then  arises  about  how  to  interpret  this  finding  in 
practice.  One  interpretation  is  that  discourse-based  treatments  have 
a  stronger  impact  on  knowledge  acquisition  for  all  students  when 
less  than  12%  of  the  class  are  ELs.  Conversely,  this  finding  may 
mean  that  for  a  discourse-based  treatment  to  sustain  its  impact  as 
the  proportion  of  ELs  in  the  class  increases,  additional  supports  are 
necessary.  We  hesitate  to  conjecture  too  much  about  the  practical 
implications  from  this  study  until  further  studies  confirm  this 
finding. 

Although  students  in  the  treatment  classrooms  outperformed 
students  in  the  BAU  comparison  classrooms  on  the  measure  of 
content  knowledge  acquisition,  it  should  be  noted  that  the  increase 
in  ELs  in  the  classroom  in  more  traditional  instruction  (i.e.,  BAU 
comparison  condition)  in  social  studies  did  not  negatively  affect 
the  performance  of  ELs.  One  possible  explanation  for  this  finding 
is  that  in  traditionally  instructed  social  studies  classes  (i.e.,  BAU), 
ELs  spend  little  time  interacting  with  text  and  with  each  other  to 
establish  meaning  (Swanson  et  ah,  2015). 

Examining  the  moderating  effect  of  proportion  of  ELs  in  the  class 
on  content  comprehension  and  general  comprehension  yielded  differ¬ 
ent  findings  than  those  for  content  knowledge  acquisition.  The  ab¬ 
sence  of  a  significant  two-  or  three-way  interaction  with  Intervention 
suggests  that  EL  status,  Class  %  of  ELs,  and  the  interaction  of  EL  and 
Class  %  of  ELs  do  not  influence  the  treatment’s  effect  on  students’ 
content  reading  comprehension  (MASK)  or  general  reading  compre¬ 
hension  (Gates-MacGinitie)  outcomes.  This  finding  may  be  because 
of  the  fact  that  peer  discourse  and  language  use  occurred  largely  in  the 
PACT  components  related  to  content  knowledge  (e.g.,  TBL  compre¬ 
hension  check,  TBL  knowledge  application),  rather  than  reading 
comprehension  (e.g.,  knowledge  acquisition  through  text  reading). 
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Several  limitations  to  the  current  study  should  be  noted.  First, 
this  study  was  conducted  in  three  school  districts  in  the  southeast 
and  southwest  United  States,  limiting  generalization  to  these  stu¬ 
dents  and  teachers  in  these  regions.  In  addition,  a  limitation  to 
most  studies  investigating  ELs  in  secondary  settings,  including  this 
one,  is  the  lack  of  availability  of  participants’  language  proficiency 
in  their  first  and  second  languages.  A  standardized  measure  of 
English  proficiency  would  be  necessary  to  address  whether  the 
saturation  of  ELs  in  the  classroom  is  a  less  significant  factor 
when  students’  proficiency  in  English  is  higher.  We  are  unable  to 
address  this  important  question  because  we  had  only  district- 
identified  classification  information  on  ELs  and  were  not  able  to 
obtain  a  measure  of  English  proficiency. 
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Expanding  the  Developmental  Models  of  Writing:  A  Direct  and  Indirect 
Effects  Model  of  Developmental  Writing  (DIEW) 

Young-Suk  Grace  Kim  and  Christopher  Schatschneider 

Florida  State  University  and  Florida  Center  for  Reading  Research 


We  investigated  direct  and  indirect  effects  of  component  skills  on  writing  (DIEW)  using  data  from  193 
children  in  Grade  1.  In  this  model,  working  memory  was  hypothesized  to  be  a  foundational  cognitive 
ability  for  language  and  cognitive  skills  as  well  as  transcription  skills,  which,  in  turn,  contribute  to 
writing.  Foundational  oral  language  skills  (vocabulary  and  grammatical  knowledge)  and  higher-order 
cognitive  skills  (inference  and  theory  of  mind)  were  hypothesized  to  be  component  skills  of  text 
generation  (i.e.,  discourse-level  oral  language).  Results  from  structural  equation  modeling  largely 
supported  a  complete  mediation  model  among  4  variations  of  the  DIEW  model.  Discourse-level  oral 
language,  spelling,  and  handwriting  fluency  completely  mediated  the  relations  of  higher-order  cognitive 
skills,  foundational  oral  language,  and  working  memory  to  writing.  Moreover,  language  and  cognitive 
skills  had  both  direct  and  indirect  relations  to  discourse-level  oral  language.  Total  effects,  including  direct 
and  indirect  effects,  were  substantial  for  discourse-level  oral  language  (.46),  working  memory  (.43),  and 
spelling  (.37);  followed  by  vocabulary  (.19),  handwriting  (.17),  theory  of  mind  (.12),  inference  (.10),  and 
grammatical  knowledge  (.10).  The  model  explained  approximately  67%  of  variance  in  writing  quality. 
These  results  indicate  that  multiple  language  and  cognitive  skills  make  direct  and  indirect  contributions, 
and  it  is  important  to  consider  both  direct  and  indirect  pathways  of  influences  when  considering  skills  that 
are  important  to  writing. 

Keywords:  developmental  model  of  writing,  cognitive  skills,  oral  language  skills,  direct  effect,  indirect 
effect 


Writing  is  one  of  the  most  complex  tasks  (Olive,  2004),  drawing 
on  a  large  number  of  language  and  cognitive  skills.  Two  prominent 
models  of  developmental  writing  with  empirical  support  include 
the  simple  view  of  writing  and  not-so-simple  view  of  writing. 
According  to  the  simple  view  of  writing,  writing  is  a  product  of 
two  necessary  skills,  transcription  and  ideation  (also  called  text 
generation;  Berninger,  Abbott,  Abbott,  Graham,  &  Richards,  2002; 
Juel,  Griffith,  &  Gough,  1986).  The  not-so-simple  view  of  writing 
expanded  the  simple  view  of  writing  in  two  ways.  First,  executive 
function  and  self-regulatory  processes  (e.g.,  attention,  goal  setting, 
reviewing)  were  included,  in  addition  to  text  generation  and  tran¬ 
scription  skills  (Berninger  &  Amtmann,  2003;  Berninger  &  Winn, 
2006).  Second,  working  memory  was  hypothesized  to  be  at  the 
center  of  these  three  components  (text  generation,  transcription, 
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and  self-regulation),  needed  for  accessing  long-term  memory  dur¬ 
ing  planning  and  composing  process  and  short-term  memory  dur¬ 
ing  review  process  (Berninger  &  Winn,  2006). 

Although  highly  informative,  these  two  models  lacked  specific¬ 
ity  about  component  skills,  particularly  for  text  generation  and 
relations  among  component  skills.  In  the  present  study,  our  goal 
was  to  expand  the  developmental  models  of  writing  by  investigat¬ 
ing  component  skills  of  text  generation,  and  their  relations  to 
writing  quality.  To  this  end,  we  used  data  from  beginning  writers 
to  test  a  direct  and  mediated  model  of  text  generation  (i.e., 
discourse-level  language),  and  four  different  variations  of  the 
direct  and  indirect  effects  models  of  writing  (DIEW). 

Developmental  Models  of  Writing  and  Component 
Skills  of  Writing 

As  writing  requires  written  texts,  transcription — the  process  and 
physical  acts  of  representing  sounds  to  written  symbols,  including 
spelling  and  handwriting  skills  (McCutchen,  2000) — is  necessary. 
Lack  of  accuracy  and  fluency  in  transcription  skills  constrain 
writing  by  interfering  with  higher-order  skills  such  as  planning  and 
content  generation  (Bourdin  &  Fayol,  1994;  Graham,  Berninger, 
Abbott,  Abbott,  &  Whitaker,  1997;  McCutchen,  2000).  Much 
evidence  has  supported  the  importance  of  transcription  skills  in 
writing  (Abbott  &  Berninger,  1993;  Berninger  et  al.,  2002; 
Berninger  et  ah,  1997;  Graham  et  ah,  1997;  Graham  &  Harris, 
2000;  Kim  et  ah,  2011;  Kim,  Al  Otaiba,  Wanzek,  &  Gatlin,  2015; 
Kim,  Al  Otaiba,  Sidler,  Greulich,  &  Puranik,  2013;  Kim,  Park,  & 
Park,  2015;  Limpo  &  Alves,  2013;  McCutchen,  1996). 
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Ideation  or  text  generation  includes  generation  and  organization 
of  ideas  (Juel  et  al.,  1986).  Text  generation  necessarily  involves 
oral  language  representation  (Beminger  et  al.,  2002;  Kim  et  al., 
2011;  McCutchen,  2006)  because  generated  pre verbal  ideas  and 
thoughts  have  to  be  encoded  into  oral  language  before  being 
transcribed  into  written  texts.  Therefore,  text  generation  is  opera¬ 
tionalized  as  oral  language  skills.  Accumulating  evidence  has 
indeed  indicated  the  relation  of  oral  language  skills  to  writing  (e.g., 
Shanahan  &  Lomax,  1986).  Individual  differences  in  vocabulary 
(Coker,  2006)  and  grammatical  knowledge  (Olinghouse,  2008) 
were  related  to  writing  for  children  in  primary  grades.  Similarly, 
oral  language  composed  of  vocabulary  and  grammatical  knowl¬ 
edge  was  independently  related  to  writing  for  primary-grade  chil¬ 
dren  after  accounting  for  transcription  skills  (Kim  et  al.,  2011; 
Kim,  Al  Otaiba,  Folsom,  Greulich,  &  Puranik,  2014).  Furthermore, 
discourse-level  oral  language  was  related  to  writing  after  account¬ 
ing  for  spelling  (Juel  et  al.,  1986;  Kim,  Al  Otaiba  et  al.,  2015)  and 
sentence  and  reading  comprehension  (Beminger  &  Abbott,  2010). 

Both  the  simple  view  and  not-so-simple  view  of  writing  have 
been  highly  useful  as  a  framework  for  understanding  development 
of  writing  skills.  However,  some  critical  aspects  of  these  models 
are  underspecified,  particularly  with  regard  to  interrelations  among 
component  skills  and  pathways  of  influences  of  component  skills 
on  writing.  This  underspecification  is  most  prominent  with  text 
generation.  Although  text  generation  has  been  described  as  a 
complex  (Juel  et  al.,  1986)  and  dynamic  process  where  ideas  are 
produced  and  represented  as  language  in  memory  at  the  word, 
sentence,  and  discourse  level  (Beminger  et  al.,  2002),  no  further 
details  are  elaborated  with  regard  to  skills  that  contribute  to  text 
generation  (or  oral  language  generation).  This  contrasts  sharply  to 
a  greater  specification  about  skills  involved  in  transcription  pro¬ 
cesses,  including  phonological  processing,  orthographic  knowl¬ 
edge  (e.g.,  print  experience,  phoneme-grapheme  correspondences), 
and  morphological  skills  (Beminger  et  al.,  2002;  Juel  et  al.,  1986). 
In  fact,  when  Juel,  Griffith,  and  Gough  (1986)  examined  the 
simple  view  of  writing,  they  included  component  skills  of  spelling 
(phonological  awareness  and  exposure  to  print)  and  pathways  of 
their  influences.  They  found  that  phonological  awareness  and 
exposure  to  print  were  directly  related  to  a  phonological  decoding 
skill,  which  directly  influenced  children’s  spelling,  which,  in  turn, 
was  directly  related  to  writing.  These  results  suggest  that  there  are 
multiple  component  skills  necessary  for  a  transcription  skill,  spell¬ 
ing,  and  some  have  direct  relations,  whereas  others  have  indirect 
relations  to  spelling.  Critically  missing  in  Juel  et  al.’s  (1986)  study, 
however,  was  component  skills  of  text  generation,  which  was 
operationalized  as  a  discourse-level  oral  language  production.  An 
understanding  about  component  skills  of  discourse-level  oral  lan¬ 
guage  is  critical  to  the  expansion  of  our  knowledge  about  skills 
involved  in  writing  development,  and  has  important  implications 
for  instruction  and  assessments.  Specifically,  a  precise  understand¬ 
ing  about  component  skills  of  discourse-level  oral  language  would 
inform  what  skills  need  to  be  assessed  and  targeted  in  instruction 
in  order  to  improve  discourse-level  oral  language  as  well  as 
writing. 

Component  Skills  of  Discourse-Level  Oral  Language 

Discourse-level  oral  language  refers  to  comprehension  and  pro¬ 
duction  of  multiple  utterances  or  extended  text  such  as  conversa¬ 


tions,  and  narrative  and  informational  oral  texts  (Kim  &  Pilcher,  in 
press).  Growing  evidence  indicates  that  discourse-level  oral  lan¬ 
guage  is  a  higher-order  skill  that  draws  on  a  multitude  of  language 
and  cognitive  skills,  including  foundational  oral  language  skills 
(vocabulary  and  grammatical  knowledge;  Florit,  Roch,  &  Lev- 
orato,  2011,  2014;  Kim,  2015,  2016;  Senechal,  Ouellette,  &  Rod¬ 
ney,  2006;  Tunmer,  1989),  foundational  cognitive  skills  (working 
memory,  inhibitory  control,  attention;  Daneman  &  Merikle,  1996; 
Florit,  Roch,  Altoe,  &  Levorato,  2009;  Kim,  2015,  2016;  Kim  & 
Phillips,  2014),  and  higher-order  cognitive  skills  (e.g.,  inference 
and  theory  of  mind;  Kendeou,  Bohn-Gettler,  White,  &  van  den 
Broek,  2008;  Kim,  2015,  2016;  Kim  &  Phillips,  2014;  Lepola, 
Lynch,  Laakkonen,  Silven,  &  Niemi,  2012;  Strasser  &  del  Rio, 
2014;  Tompkins,  Guo,  &  Justice,  2013). 

According  to  theoretical  models  of  discourse  comprehension 
and  production,  there  are  three  levels  of  mental  representations:  the 
situation  model,  textbase,  and  surface  code  (e.g.,  Fletcher  & 
Chrysler,  1990;  Graesser,  Singer,  &  Trabasso,  1994;  Kintsch, 
1988;  van  Dijk  &  Kintsch,  1983).  The  situation  model  is  the 
interlocutor’s  representation  of  the  events,  actions,  and  characters 
(what  the  text  is  about;  van  Dijk  &  Kintsch,  1983),  and  is  the 
highest  level  of  mental  representation.  The  situation  model  is  built 
on  textbase  representation  (propositional  representation — what  is 
expressed  in  the  text),  which  then  requires  surface  code  represen¬ 
tation  (linguistic  input  of  the  text  such  as  words  and  phrases — how 
something  is  expressed  in  the  text).  The  situation  model  is  more 
than  an  assembly  of  propositions,  and  requires  linking  propositions 
across  the  text  and  to  general  background  knowledge  in  order  to 
integrate  and  infer  meanings  and  establish  a  coherent  whole 
(Graesser  et  al.,  1994;  Kintsch,  1988;  van  Dijk  &  Kintsch,  1983; 
van  den  Broek,  Risden,  Fletcher,  &  Thurlow,  1996). 

Recently  Kim  (2016)  proposed  and  tested  the  direct  and  mediated 
model  of  discourse-level  language,  in  which  different  language  and 
cognitive  skills  are  mapped  onto  the  three  levels  of  mental  represen¬ 
tations,  and  are  hypothesized  to  be  directly  and  indirectly  related  to 
discourse-level  oral  language  (see  Figure  1  for  a  conceptual  model). 
For  the  process  of  establishing  global  coherence  (i.e.,  situation 
model),  higher-order  cognitive  skills  such  as  inference  and  perspec¬ 
tive  taking  (as  measured  by  theory  of  mind  tasks)  are  important  (Kim, 
2015,  2016;  Kim  &  Phillips,  2014).  Furthermore,  vocabulary,  gram¬ 
matical  knowledge,  working  memory,  and  attentional  control  are 
necessary  for  constructing  initial  propositions  (i.e.,  textbase  represen¬ 
tation;  Kim,  2015, 2016).  Note  that  in  this  conceptual  model,  although 
all  the  foundational  language  and  cognitive  skills  are  necessary  for 
surface  code  representation,  working  memory  and  attentional  control 
are  hypothesized  to  be  foundational  cognitive  skills  necessary  for  any 
learning  tasks,  including  vocabulary  and  grammatical  knowledge. 
The  direct  and  mediated  models  of  discourse-level  language  fit  data 
very  well  for  discourse  comprehension  for  elementary  grade  children 
(Kim,  2015,  2016)  such  that  discourse-level  language  comprehension 
(i.e.,  listening  comprehension)  was  directly  predicted  by  higher-order 
cognitive  skills  (e.g.,  inference,  perspective  taking,  and  comprehen¬ 
sion  monitoring),  which,  in  turn,  were  directly  predicted  by  founda¬ 
tional  oral  language  (vocabulary  and  grammatical  knowledge)  and 
cognitive  skills  (working  memory;  Kim,  2015,  2016).  Furthermore, 
working  memory  was  also  directly  related  to  vocabulary  and  gram¬ 
matical  knowledge,  as  well  as  discourse-level  oral  language  over  and 
above  foundational  oral  language  and  higher-order  cognitive  skills 
(Kim,  2016). 


DEVELOPMENTAL  MODEL  OF  WRITING 


37 


Situation 

model 


Textbase 


Surface  code 


Discourse  comprehension 
and  production 


Higher  order  cognitive  skills 

(e.g.,  Inference,  perspective  taking, 
comprehension  monitoring) 


Foundational  language  and  cognitive  skills 

(e.g.,  Vocabulary,  Grammatical  knowledge 
Working  memory,  Attentional  control) 


Figure  1.  Language  and  cognitive  skills  associated  with  three  levels  of  text  representations  (modified  from 
Kim,  2016,  reprint  with  permission). 


Present  Study 

Building  on  this  growing  evidence  about  discourse-level  oral 
language,  and  previous  studies  about  component  skills  of  writing 
(e.g.,  transcription  skills,  working  memory,  and  language  skills), 
the  primary  goal  in  the  present  study  was  to  unpack  the  nature  of 
relations  between  various  language  and  cognitive  skills  and  writ¬ 
ing  for  beginning  writers.  To  achieve  this  goal,  we  first  examined 
the  direct  and  mediated  relations  of  component  skills  of  discourse- 
level  language  production.  If  discourse-level  oral  language  is  an 
upper-level  skill  that  draws  on  several  language  and  cognitive 
component  skills,  then  an  important  corollary  is  how  all  these 
component  skills,  including  discourse-level  oral  language,  lan¬ 
guage  and  cognitive  component  skills  (e.g.,  working  memory, 
vocabulary,  grammatical  knowledge,  inference,  and  perspective 
taking),  and  transcription  skills,  fit  into  the  developmental  models 
of  writing.  For  instance,  vocabulary  and  grammatical  knowledge 
were  shown  to  be  related  to  writing  (Kim  et  al.,  2014,  2011; 
Olinghouse,  2008).  If  this  is  the  case,  would  they  then  be  directly 
related  to  writing  over  and  above  discourse-level  oral  language,  or 
would  their  relations  be  primarily  mediated  by  discourse-level  oral 
language? 

Additionally,  how  are  higher-order  cognitive  skills  such  as 
inference  and  theory  of  mind  related  to  writing?  Are  they  related 
to  writing,  and  if  so,  are  their  relations  direct  or  primarily  mediated 
via  discourse-level  oral  language?  Although  developmental  mod¬ 
els  of  writing  did  not  explicitly  specify  the  roles  of  higher-order 
cognitive  skills  in  writing  and  novice  learners  tend  to  rely  on 
less-sophisticated  knowledge-telling  strategies  (Bereiter  &  Scar- 
damalia,  1987),  successful  writing,  even  for  beginning  writers, 
might  draw  on  higher-order  cognitive  skills  such  as  reasoning  and 
perspective  taking  (e.g.,  writing  for  audience;  also  called  metacog- 
nitive  control,  see  McCutchen,  1988).  In  coherent  written  compo¬ 


sitions,  ideas  within  the  text  are  tightly  connected  with  each  other 
and  presented  in  a  logical  fashion.  This  would  require  a  writer’s 
reasoning  and  inferencing  skill.  Likewise,  good  writers  develop  an 
understanding  about  the  needs  of  their  audience  (Engler,  Raphael, 
Anderson,  Anthony,  &  Stevens,  1991)  and  modulate  language 
accordingly  (McCutchen,  1988).  Even  young  children  showed 
planning  for  a  specific  audience  by  adapting  oral  text  production 
considering  audience’s  needs  (e.g.,  Cameron  &  Wang,  1999;  De 
Temple,  Wu,  &  Snow,  1991;  Littleton,  1998;  McCutchen,  1988). 
Therefore,  it  is  reasonable  to  speculate  that  a  higher-order  cogni¬ 
tive  skill,  perspective  taking  as  measured  by  theory  of  mind  tasks, 
would  relate  to  writing.  Theory  of  mind  refers  to  one’s  knowledge 
of  the  mental  status  of  others  (thoughts  and  emotions)  and  per¬ 
spective  taking,  and  is  typically  assessed  by  false  belief  tasks  (see 
Astington  &  Jenkins,  1999;  de  Villiers,  2000;  Howlin,  Baron- 
Cohen,  &  Hadwin,  1999;  Norbury,  2005).  In  a  typical  false  belief 
task,  the  child  listens  to  a  series  of  events  and  connects  the  events 
to  infer  characters’  cognitive  statuses,  and  thus  requires  an  under¬ 
standing  of  different  perspectives  (Comay,  2009;  Kim,  2015;  Kim 
&  Phillips,  2014). 

In  order  to  investigate  the  nature  of  language  and  cognitive 
component  skills  and  their  relations  to  writing,  we  evaluated  four 
different  variations  of  the  direct  and  indirect  effects  models  of 
writing  (DIEW).  The  DIEW  model  is  built  on  the  extant  develop¬ 
mental  models,  such  as  the  simple  view  and  not-so-simple  view  of 
writing,  but  extends  them  by  explicitly  hypothesizing  direct  and 
indirect  relations  among  components  skills  and  their  relations  to 
writing  based  on  theory  and  empirical  evidence.  Prior  to  fitting  the 
DIEW  model,  we  first  examined  the  relations  of  language  and 
cognitive  skills  to  discourse-level  oral  language  (see  Figure  1).  As 
noted  above,  working  memory  was  hypothesized  to  be  a  founda¬ 
tional  cognitive  ability  necessary  for  any  learning  tasks  including 
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Figure  2.  The  relations  of  inference,  theory  of  mind  (ToM),  spelling,  and  sentence  copying  fluency  to  writing. 


vocabulary  and  grammatical  knowledge  (see  Figure  4).  We  then 
investigated  the  relations  of  higher-order  cognitive  skills  to  writing 
after  accounting  for  transcription  skills  (spelling  and  handwriting) 
and  working  memory  (see  Figure  2).  Finally,  four  alternative 
models  of  DIEW  (Figures  3a,  3b,  3c,  and  3d)  were  fitted  and 
compared.  In  the  first  model  (Figure  3a,  a  complete  mediation 
model),  discourse-level  oral  language  and  transcription  skills 
(spelling  and  handwriting  fluency)  were  hypothesized  to  com¬ 
pletely  mediate  the  relations  of  oral  language  and  cognitive  com¬ 
ponent  skills  to  writing.  Discourse-level  oral  language  was  hypoth¬ 
esized  to  be  directly  predicted  by  higher-order  cognitive  skills 
(inference  and  theory  of  mind),  and  directly  and  indirectly  pre¬ 
dicted  by  foundational  oral  language  skills  (vocabulary  and  gram¬ 
matical  knowledge),  and  the  foundational  cognitive  skill  (working 
memory).  In  an  alternative  partial  mediation  model,  vocabulary 
and  grammatical  knowledge  (Figure  3b)  and  higher  order  skills 
(Figure  3c)  were,  respectively,  hypothesized  to  have  direct  rela¬ 
tions  to  writing  over  and  above  discourse-level  oral  language  and 
transcription  skills. 

The  final  DIEW  model  (Figure  3d)  tested  whether  working 
memory  is  directly  related  to  writing  after  accounting  for  its 
contribution  to  all  other  component  skills.  As  writing  requires 
coordinating  multiple  processes  such  as  generating  ideas  and  tran¬ 
scribing  those  ideas  into  written  products,  writing  places  a  great 
demand  on  working  memory  (Kellogg,  1996,  2008;  Kellogg,  Ol¬ 
ive,  &  Piolat,  2007;  McCutchen,  2006).  Working  memory  is 
necessary  to  support  transcription  processes  (Beminger  et  al., 
2010),  particularly  when  transcription  is  not  automatic  (Mc¬ 
Cutchen,  1996).  Fluent  transcription  skills  would  allow  working 
memory  to  be  available  for  higher-level  cognitive  processes,  such 
as  planning  and  revising  (McCutchen,  2006)  and  text  generation 
and  linguistic  encoding  (Bereiter  &  Scardamalia,  1987;  Hayes  & 
Chenoweth,  2007;  Kellogg,  1996).  Furthermore,  working  memory 
has  been  shown  to  be  critical  to  vocabulary  development  (Gath- 
ercole  &  Baddeley,  1990a,  1990b,  1993;  Gathercole,  Service, 
Hitch,  Adams,  &  Martin,  1999),  grammatical  knowledge  (Kim, 
2015,  2016),  higher-order  cognitive  skills  (Carlson,  Moses,  & 
Breton,  2002;  Kim,  2015,  2016;  Slade  &  Ruffman,  2005),  and 
discourse-level  oral  language  (Kim,  2015,  2016;  Strasser  &  del 


Rio,  2014).  Taken  together,  these  studies  suggest  that  working 
memory  is  a  foundational  cognitive  capacity  for  transcription  as 
well  as  text  generation  processes.  In  order  to  explicitly  test  the 
pathway  of  influence  of  memory  to  writing,  a  direct  path  from 
working  memory  to  writing  was  tested  after  accounting  for  all  the 
other  language  and  cognitive  component  skills. 

Method 

Participants 

A  total  of  193  children  in  Grade  1  from  41  classrooms  in  nine 
schools  (50%  boys;  mean  age  =  6.68;  SD  =  .48)  in  the  southeast¬ 
ern  region  of  the  United  States  participated  in  the  study.  Children 
with  identified  intellectual  disabilities  were  excluded  from  the 
study  and  there  were  no  other  selection  criteria.  The  sample 
reflects  consented  children  from  each  class  and  was  composed  of 
approximately  43%  Caucasians,  34%  African  Americans,  6%  His- 
panics,  6%  Asian  Americans,  and  7%  mixed  race.  Approximately 
6%  were  designated  as  English  language  learners  and  29%  were 
eligible  for  free  and  reduced  lunch.  The  school  districts’  records 
indicated  that  1%  of  these  children  had  language  impairment,  2% 
had  speech  impairment,  and  2%  had  multiple  learning  disabilities. 
The  participating  schools  used  explicit  instruction  on  reading  using 
Imagine  it!  (Bereiter,  2010),  but  no  formal  district-wide  curricu¬ 
lum  was  used  in  writing. 

Measures 

Reliability  estimates  for  the  included  tasks  are  reported  in  Table 
1,  and  most  were  in  the  acceptable  to  excellent  range.  Unless 
otherwise  noted,  children’s  responses  were  scored  dichotomously 
(1  —  correct;  0  =  incorrect)  for  each  item,  and  all  the  items  were 
administered  to  the  child. 

Writing.  Children  were  administered  two  prompts  from  pre¬ 
vious  studies  (Kim  et  al.,  2014,  2015;  Kim,  Al  Otaiba,  Sidler,  & 
Greulich,  2013;  McMaster,  Du,  &  Pestursdottir,  2009;  McMaster 
et  al.,  2011).  In  the  first  writing  task,  the  children  were  asked  to 
write  about  a  time  something  unusual  or  interesting  happened 
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Figure  3.  Four  alternative  models  of  the  direct  and  indirect  effects  of  developmental  writing  (DIEW).  Black 
lines  represent  predictive  paths  and  gray  lines  represent  covariances.  Oral  lang  =  Oral  language;  ToM  =  Theory 
of  mind;  Grammar  =  Grammatical  knowledge. 


when  they  got  home  from  school.  Children  were  provided  with  the 
prompt  “One  day  when  I  got  home  from  school  .  .  on  the  ruled 
writing  paper  (One  day  hereafter).  This  task  was  significantly  and 
moderately  related  to  other  standardized  and  normed  writing  tasks 
such  as  the  Wechsler  Individual  Achievement  Test  Essay  Compo¬ 
sition  task,  and  the  Woodcock-Johnson  Writing  Fluency  task 
(Kim,  A1  Otaiba  et  al,  2015).  In  the  second  prompt,  the  children 
were  provided  with  the  beginning  of  a  story  about  a  child  who 
discovers  a  castle  that  appeared  overnight.  They  were  then  told  to 
write  a  story  about  who  the  child  met  and  what  happened  inside  the 
castle  (Castle  hereafter).  Children  were  given  15  min  for  each 
prompt. 

Children’s  written  compositions  were  scored  for  writing  quality, 
using  a  modified  version  of  the  6  +  1  trait  rubric.  Writing  quality 
is  typically  operationalized  as  the  extent  and  clarity  of  idea  devel¬ 
opment  and  organization  (e.g.,  Graham,  Beminger,  &  Fan,  2007; 
Graham,  Harris,  &  Chorzempa,  2002;  Graham,  Harris,  &  Mason, 
2005;  Hooper,  Swartz,  Wakely,  de  Kruif,  &  Montgomery,  2002; 
Kim,  Al  Otaiba,  et  al.,  2015;  Olinghouse,  2008)  and  a  recent  study 
has  shown  that  four  of  the  6  +  1  traits  (i.e.,  idea  development, 


organization,  sentence  fluency,  and  word  choice)  capture  a  single 
dimension  (Kim  et  al.,  2014).  In  the  present  study,  the  extent  of 
idea  development  was  scored  on  a  scale  of  1  to  5  (see  Appendix 
A),  similar  to  a  previous  study  (Kim  et  al.,  2014).  Compositions 
with  detailed  and  rich  ideas  were  rated  higher  than  those  with 
lower  quality  idea  development.  Interrater  reliabilities  (Cohen’s 
kappa)  were  established  with  45  written  compositions  for  each 
prompt  (a  total  of  90)  and  were  .73  for  the  One  day  prompt  and  .82 
for  the  Castle  prompt. 

Working  memory.  The  listening  span  task  (Florit  et  al.,  2009; 
Kim,  2015,  2016)  was  used.  The  children  were  presented  with  a 
sentence  and  asked  to  identify  whether  the  heard  sentence  was 
correct  or  not.  After  hearing  sentences,  they  were  asked  to  recall 
the  last  words  in  the  sentences.  All  the  sentences  involved  common 
knowledge  familiar  to  children  (e.g.,  pigs  can  fly).  Testing  was 
discontinued  after  three  consecutive  incorrect  responses.  There 
were  four  practice  items  and  14  test  items.  Children’s  yes/no 
responses  regarding  the  veracity  of  the  statement  were  not  scored, 
but  their  responses  on  the  last  words  in  correct  order  were  given  a 
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Figure  4.  Standardized  path  coefficients  of  higher  order  cognitive  skills  (inference  and  theory  of  mind), 
foundational  language  skills  (vocabulary,  grammatical  knowledge),  and  working  memory  to  discourse  level  oral 
language  production.  Solid  lines  represent  statistically  significant  relations  whereas  dashed  lines  represent 
nonsignificant  relations.  Gray  lines  represent  covariances.  TNL  =  test  of  narrative  language;  Oral  Lang  =  oral 
language;  ToM  =  theory  of  mind;  Grammar  =  Grammatical  knowledge. 


score  of  2,  and  correct  responses  in  incorrect  order  were  given  a 
score  of  1.  Therefore,  the  total  possible  maximum  score  was  28. 

Spelling.  An  experimental  dictation  task  was  developed,  pi¬ 
loted,  and  used  in  order  to  capture  the  ability  to  spell  words  that  are 
relevant  to  children  in  Grade  1  (e.g.,  consonant-vowel-consonant 
[CVC],  CVCe  words,  vowel  digraphs).  In  this  task,  the  children 
were  asked  to  spell  target  words  accurately.  Target  words  were 


presented  in  isolation,  in  a  sentence,  and  in  isolation  again.  There 
were  a  total  of  20  items. 

Handwriting  fluency.  Children  were  asked  to  accurately 
copy  a  sentence,  The  quick  brown  fox  jumps  over  the  lazy  dog,  as 
many  times  as  possible  in  1  min.  This  sentence  is  a  pangram  which 
includes  every  letter  of  the  English  alphabet  at  least  once,  and  has 
been  used  as  a  measure  of  handwriting  fluency  (e.g.,  Connelly, 


Table  1 

Reliability  and  Descriptive  Statistics 


Reliability 

Mean  ( SD ) 

Min-Max 

Skewness 

Kurtosis 

Age 

NA 

6.68  (.48) 

6-8.11 

-.25 

-1.16 

Working  memory 

.74 

13.92  (5.64) 

0-24 

-.59 

-.20 

EVT 

.94 

94.16(14.68) 

58-149 

-.15 

.56 

EVT_SS 

NA 

104.26(12.77) 

72-150 

-.07 

.35 

Grammatical  knowledge 

.90 

26.51  (6.57) 

2-36 

-1.40 

2.39 

Inference 

.89 

14.09  (6.07) 

0-25 

-.55 

»  -.46 

Theory  of  mind 

.79 

9.21  (3.39) 

2-16 

.04 

-.72 

TNL  retell 

.87+ 

30.22(10.16) 

0-51 

-.30 

-.13 

Expository  retell 

.88  + 

9.63  (6.11) 

0-27 

.53 

-.4 

Spelling 

.90 

8.86  (4.91) 

0-20 

.28 

-.52 

Sentence  copying 

.90* 

34.21  (15.59) 

3-105 

1.11 

2.79 

Writing  quality:  One  day 

.73  + 

2.66  (.96) 

0-5 

-.39 

.03 

Writing  quality:  Castle 

.82+ 

2.45  (.99) 

0-5 

.36 

.18 

Note.  SS  =  standard  score.  Reliabilities  are  Cronbach’s  alpha  except  +  (Cohen’s  kappa)  and  *  (exact  percent 
agreement). 
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Gee,  &  Walsh,  2007;  Wagner  et  al.,  2011;  Zhang,  McBride-Chang, 
Wagner,  &  Chan,  2014)  and  was  related  to  writing  quality  (Wag¬ 
ner  et  al.,  2011,  Zhang  et  al.,  2014).  Children’s  responses  were 
scored  by  counting  the  number  of  letters  copied  correctly. 

Vocabulary.  The  Expressive  Vocabulary  Test-2nd  edition 
(Williams,  2007)  was  used.  The  children  were  asked  to  identify 
pictured  objects  or  provide  synonyms.  Test  administration  discon¬ 
tinued  after  six  consecutive  incorrect  items. 

Grammatical  knowledge.  The  grammaticality  judgment  task 
of  the  Comprehensive  Assessment  of  Spoken  Language  (CASL; 
C arrow- Woolfolk,  1999)  was  used.  This  task  is  normed  for  chil¬ 
dren  in  Grades  2  and  above,  and  therefore  a  few  easy  items  were 
developed  modeling  the  items  in  the  CASL.  These  items  were  then 
piloted  and  used  in  the  first  few  items.  In  other  words,  the  items  in 
this  task  included  a  few  experimental  items  as  well  as  items  in  the 
grammaticality  judgment  task  of  CASL.  Children’s  performance 
on  the  grammaticality  judgment  task  was  related  to  syntax  con¬ 
struction  (r  =  .66)  and  grammatical  morphemes  ( r  =  .66;  Carrow- 
Woolfolk,  1999).  In  this  task,  the  children  heard  a  sentence  (e.g., 
The  children  are  run)  and  were  asked  whether  the  sentence  was 
grammatically  correct.  If  grammatically  incorrect,  the  child  was 
asked  to  correct  the  sentence.  There  were  three  practice  items  and 
20  test  items.  Test  administration  discontinued  after  five  consec¬ 
utive  incorrect  items.  Of  the  20  test  items,  17  items  included 
grammatically  incorrect  sentences  (see  the  example  above),  and 
for  these  items,  a  total  2  points  were  possible  (1  for  identifying 
grammatical  inaccuracy,  and  1  for  accurately  correcting  the  sen¬ 
tence).  Therefore,  the  total  possible  maximum  in  the  grammatical 
knowledge  task  was  36. 

Inference.  The  inference  task  of  CASL  (Carrow- Woolfolk, 
1999)  was  used.  Similar  to  the  grammaticality  judgment  task 
described  above,  this  task  is  normed  for  children  in  Grade  2  and 
above,  and  therefore,  several  easy  items  were  developed,  piloted, 
and  used  in  the  first  few  test  items.  In  this  task,  the  children  were 
asked  to  infer  information  from  heard  sentences  based  on  their 
background  knowledge.  They  heard  two  to  three  sentence  stories, 
and  were  asked  a  question  that  required  inference  based  on  back¬ 
ground  knowledge.  For  instance,  the  children  heard  “Mother 
called  to  four-year-old  Sandra  and  says  ‘Be  sure  to  bring  your 
bathing  suit.  And  don ’t  forget  your  shovel  and  bucket.  ’  Where  are 
they  going?”  The  correct  responses  include  “to  the  beach”  or  “to 
go  swimming”  or  something  similar.  There  were  two  practice 
items  and  25  test  items.  Test  administration  discontinued  after  five 
consecutive  incorrect  items.  Performance  on  the  inference  task  was 
reported  to  be  strongly  related  to  the  nonliteral  language  task  (r  = 
.73;  Carrow- Woolfolk,  1999). 

Theory  of  mind.  One  first-order  false  belief  scenario  and  two 
second-order  false  belief  scenarios  were  used  (Kim,  2015;  Kim  & 
Phillips,  2014).  The  first-order  task  examines  the  child’s  ability  to 
infer  a  story  character’s  mistaken  belief  whereas  the  second-order 
task  examines  the  child’s  ability  to  infer  a  story  character’s  mis¬ 
taken  belief  about  another  character’s  knowledge  (see  Caillies  & 
Le  Soum-Bissaoui,  2008  for  further  details).  The  first-order  false 
belief  task  involved  the  location  of  a  basketball  in  school,  and  the 
other  two  second-order  tasks  involved  the  context  of  a  bake  sale 
and  going  out  for  a  birthday  celebration.  The  assessor  presented 
stories  to  the  children  using  a  series  of  illustrations,  followed  by 
the  assessor’s  questions.  There  were  a  total  of  16  questions. 


Discourse-level  oral  language.  The  Test  of  Narrative  Lan¬ 
guage  (TNL;  Gillam  &  Pearson,  2004)  and  an  experimental  ex¬ 
pository  task  were  used.  In  the  TNL  test,  only  Story  1  (Task  1)  has 
a  retell  task.  However,  in  the  present  study,  we  adapted  the  TNL 
test  so  that  the  children  were  asked  to  retell  three  narrative  stories 
(Tasks  1,  3  and  5)  after  they  heard  each  story.  The  experimental 
expository  task  was  composed  of  three  expository  passages  (85 
words,  76  words,  and  140  words,  respectively)  from  the  Qualita¬ 
tive  Reading  Inventory-5  passages  (Leslie  &  Caldwell,  2011). 
Titles  of  the  passages  were  Air,  The  brain  and  the  five  senses,  and 
Changing  matter.  After  listening  to  each  passage,  the  children 
were  asked  to  retell  each  story. 

Children’s  retell  was  recorded  using  a  digital  recorder,  Olympus 
VN  8100  pc,  and  was  transcribed  verbatim  following  Systematic 
Analysis  of  Language  Transcription  (SALT;  Miller  &  Iglesias, 
2006)  guidelines.  Children’s  retell  quality  was  evaluated  using 
transcribed  data.  Narrative  retell  quality  was  determined  by  the 
extent  to  which  key  narrative  elements  (e.g.,  main  characters, 
setting,  events,  problem,  and  resolution)  and  key  details  were 
included  (e.g.,  Barnes,  Kim,  &  Phillips,  2014;  Scott  &  Windsor, 
2000).  Narrative  quality  using  this  approach  was  moderately  re¬ 
lated  to  discourse  comprehension  (Barnes  et  al.,  2014;  Scott  & 
Windsor,  2000).  Children’s  performance  on  each  element  was 
rated  on  a  scale  of  0-3,  with  the  exception  of  the  resolution 
element  for  Task  1,  which  was  on  a  scale  of  0-2.  The  children 
received  0  for  no  inclusion  of  the  story  elements,  1  for  a  partially 
correct  or  implicitly  stated  element,  2  for  a  correct  but  imprecise 
statement,  and  3  for  a  precise  statement.  For  expository  retell,  the 
number  of  a  priori  identified  key  details  (each  worth  a  point)  was 
counted.  Interrater  reliability  was  estimated  using  40  transcripts 
and  Cohen’s  kappa  (see  Table  1). 

Procedures 

Children  were  assessed  by  rigorously  trained  research  assistants 
in  a  quiet  space  in  the  school.  Assessment  battery  was  adminis¬ 
tered  in  several  sessions  and  each  session  was  approximately  30  to 
40  min.  Writing,  spelling,  and  handwriting  fluency  tasks  were 
administered  in  a  group  setting  (3-4  children),  and  the  other  tasks 
were  individually  administered. 

Data  Analysis  Strategy 

Confirmatory  factory  analysis  and  structural  equation  Modeling 
(SEM)  were  primary  data  analytic  strategies,  using  Mplus  7.1 
(Muthen  &  Muthen,  2013).  Latent  variables  were  created  for 
writing  and  discourse-level  oral  language.  The  language  (e.g., 
vocabulary),  cognitive  skills  (e.g.,  inference),  and  transcription 
skills  were  assessed  by  single  measures  for  each  construct,  and 
therefore  observed  variables  were  used.  Model  fits  were  evaluated 
by  the  following  indices:  chi-square  statistics,  comparative  fit 
index  (CFI),  the  Tucker-Lewis  index  (TLI),  root  mean  square  error 
of  approximation  (RMSEA),  and  standardized  root  mean  square 
residuals  (SRMR).  Excellent  model  fits  include  RMSEA  values 
below  .08,  CFI  and  TLI  values  equal  to  or  greater  than  .95,  and 
SRMR  equal  to  or  less  than  .05  (Hu  &  Bentler,  1999).  TLI  and  CFI 
values  greater  than  .90  are  considered  acceptable  (Kline,  2005). 
Model  fits  were  compared  using  chi-square  differences  for  nested 
models. 
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Table  2 

Bivariate  Correlations  Between  Measures 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1.  Working  memory 

1.00 

2.  Vocabulary:  EVT 

.40 

1.00 

3.  Grammar 

.40 

.64 

1.00 

4.  Inference 

.41 

.66 

.61 

1.00 

5.  Theory  of  mind 

.42 

.52 

.53 

.61 

1.00 

6.  TNL  retell 

.34 

.50 

.46 

.53 

.48 

1.00 

7.  Expository  retell 

.42 

.47 

.41 

.43 

.47 

.55 

1.00 

8.  Spelling 

.39 

.46 

.39 

.30 

.21 

.30 

.37 

1.00 

9.  Sentence  copying 

.24 

.31 

.30 

.31 

.26 

.32 

.27 

.56 

1.00 

10.  Writing  quality:  One  day 

.33 

.35 

.35 

.40 

.31 

.25 

.35 

.51 

.42 

1.00 

11.  Writing  quality:  Castle 

.30 

.42 

.36 

.32 

.24 

.37 

.43 

.44 

.37 

.49 

Note.  All  coefficients  are  statistically  significant  at  .05. 


Results 

Descriptive  Statistics 

Table  1  displays  descriptive  statistics.  Children’s  mean  perfor¬ 
mance  on  the  normed  and  standardized  task,  vocabulary,  was  in 
the  average  range.  In  the  other  experimental  measures,  there  was 
sufficient  variation  around  the  means,  and  skewness  and  kurtosis 
values  were  in  the  accepted  range.  Subsequent  analysis  was  con¬ 
ducted  using  raw  scores. 

Table  2  shows  bivariate  correlations  between  measures.  All  the 
tasks  were  somewhat  weakly  to  moderately  related  to  writing 
measures  (.25  ^  rs  <  .51).  Working  memory  was  also  weakly  to 
moderately  related  to  all  other  skills  (.24  <  rs  <  .42).  Correlations 
between  other  measures  were  in  the  expected  range  and  direction. 
Multivariate  normality  was  tested  using  Henze-Zirkler’s  multivariate 
normality  test  (Henze  &  Zirkler,  1990),  and  results  indicated  that 
multivariate  normality  assumption  was  met  {HZ  =  .995,  p  =  .  14). 

Direct  and  Mediated  Model  of 
Discourse-Level  Language 

The  model  shown  in  Figure  1  fit  the  data  very  well,  x2(4)  = 
5.67,  p  =  .23,  CFI  =  1.00,  TLI  -  .98,  RMSEA  =  .048,  SRMR  = 
.016.  As  shown  in  Figure  4,  theory  of  mind  ((3  =  .27,  p  =  .003), 
vocabulary  (y  =  .26,  p  =  .007),  and  working  memory  (y  =  .18, 
p  =  .02)  were  directly  related  to  discourse-level  oral  language, 
whereas  inference  ((3  =  .19,  p  =  .058)  and  grammatical  knowl¬ 
edge  (y  =  .08,  p  =  .39)  were  not.  Inference  and  theory  of  mind 
were  predicted  by  vocabulary,  grammatical  knowledge,  and  work¬ 
ing  memory  (ps  <  .04).  Approximately  61%  of  total  variance  in 
discourse-level  oral  language  was  explained  by  the  included  lan¬ 
guage  and  cognitive  skills. 

The  Relations  of  Higher-Order  Cognitive  Skills  to 
Writing  Quality 

In  order  to  examine  the  relation  of  higher-order  cognitive  skills 
(inference  and  theory  of  mind)  to  writing,  the  model  shown  in 
Figure  2  was  fitted  to  the  data.  Model  fit  was  excellent,  x2(17)  = 
28.76,  p  =  .04,  CFI  =  .99,  TLI  =  .98,  RMSEA  =  .06,  SRMR  = 
.017.  As  shown  in  Figure  5,  inference  (|3  =  .27,  p  =  .003)  was 
independently  related  to  writing,  whereas  theory  of  mind  ((3  =  .08, 


p  =  .36)  was  not,  after  accounting  for  spelling,  handwriting 
fluency,  and  working  memory.  A  total  of  59%  of  variance  in 
writing  was  explained. 

Testing  the  DIEW  Models 

Four  alternative  DIEW  models  shown  in  Figures  3a  to  3d  were 
tested.  In  all  these  models,  covariances  were  allowed  between 
component  skills  (e.g.,  vocabulary  and  grammar;  vocabulary  and 
spelling).  Exceptions  were  between  higher-order  cognitive  skills 
(inference  and  theory  of  mind)  and  transcription  skills  because  of 
nonsignificance  in  preliminary  analysis. 

The  complete  mediation  model  (Figure  3a)  fit  the  data  well, 
X2(24)  =  41.33,  p  =  .02,  CFI  =  .98,  TLI  -  .95,  RMSEA  =  .062 
(.027-093),  SRMR  =  .031.  The  partial  mediation  models  also  had 
good  fit  to  the  data:  x2(22)  =  40.74,  p  =  .0088,  CFI  =  .98,  TLI  = 
.94,  RMSEA  =  .067  (.033-.  10),  SRMR  =  .031  for  the  model  in 
Figure  3b;  and  X2(22)  =  39.37,  p  =  .01,  CFI  =  .98,  TLI  =  .95, 
RMSEA  =  .065  (.030 -.097),  SRMR  =  .030  for  the  model  in 
Figure  3c;  and  X2(23)  =  41.21,  p  =  .011,  CFI  -  .98,  TLI  =  .95, 
RMSEA  =  .065  (.031-096),  SRMR  =  .031  for  the  model  in 
Figure  3d.  Chi-square  difference  tests  showed  no  differences  be¬ 
tween  these  models  (0.12  <Ay2^  1.96;  1  <  A df  <  2,  .16  <  p  < 
.73).  Furthermore,  in  the  partial  mediation  models,  the  direct  paths 
from  the  component  language  and  cognitive  skills  to  writing  were, 
respectively,  nonsignificant  (ps  >  .19;  see  Appendix  B).  There¬ 
fore,  based  on  parsimony  and  the  chi-square  test  results,  the 
complete  mediation  model  (Figure  3a)  was  chosen  as  the  final 
model. 

Figure  6  displays  standardized  path  coefficients  of  the  complete 
mediation  model.  Discourse-level  oral  language  ((3  =  .46,  p  < 
.001),  spelling  ((3  =  .37,  p  <  .001),  and  handwriting  fluency  ((3  — 
.17,  p  =  .047)  were  all  directly  related  to  writing  quality. 
Discourse-level  oral  language  was  directly  predicted  by  the  two 
higher-order  cognitive  skills,  inference  (|3  =  .21,  p  =  .035)  and 
theory  of  mind  ([3  =  .26,  p  =  .003).  Vocabulary  ((3  =  .42,  p  < 
.001)  and  working  memory  ((3  =  .19,  p  =  .012)  were  also  directly 
related  to  discourse-level  oral  language  after  accounting  for  all  the 
other  variables  in  the  model.  Inference  and  theory  of  mind  were 
predicted  by  vocabulary  ((3s  =  .42  and  .26,  ps  <  .001),  grammat¬ 
ical  knowledge  ((3s  =  .29  and  .28,  ps  <  .001),  and  working 
memory  (ys  =  .12  &  .20,  ps  <  .04).  Vocabulary  and  grammatical 
knowledge  were  predicted  by  working  memory  (ys  =  .40  &  .40, 
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Figure  5.  Standardized  path  coefficients  of  higher  order  cognitive  skills  (inference  and  theory  of  mind)  and 
transcription  skills  (spelling  and  sentence  copying  fluency)  to  writing.  Solid  lines  represent  statistically  significant 
relations  whereas  dashed  lines  represent  nonsignificant  relations.  Gray  lines  represent  covariances.  ToM  =  theory  of 
mind. 


ps  <  .001).  Working  memory  also  predicted  spelling  (y  =  .39,  p  < 
.001)  and  handwriting  fluency  (y  =  .23,  p  =  .001).  A  total  of  67% 
of  variance  in  writing  and  62%  of  variance  in  discourse-level  oral 
language  were  explained. 

Table  3  displays  direct,  indirect,  and  total  effects  of  the  com¬ 
ponent  skills.  The  largest  effects  were  found  for  discourse-level 
oral  language  (.46),  working  memory  (.43),  and  spelling  (.37); 
followed  by  vocabulary  (.19),  handwriting  (.17),  theory  of  mind 
(.12),  inference  (.10),  and  grammatical  knowledge  (.10). 

Discussion 

The  primary  aim  of  the  study  was  to  examine  direct  and  indirect 
relations  of  language  and  cognitive  component  skills  to  writing. 
Based  on  the  simple  view  of  writing  and  not-so-simple  view,  we 
hypothesized  that  text  generation  and  transcription  are  necessary 
for  writing  development.  Furthermore,  we  specified  component 
skills  of  discourse-level  oral  language  based  on  growing  evidence, 
and  examined  the  nature  of  their  relations  to  writing. 

The  direct  and  mediated  model  of  discourse-level  language  fit 
the  data  very  well,  such  that  foundational  language  and  cognitive 
skills  and  higher-order  cognitive  skills  were  directly  and  indirectly 
related  to  discourse-level  oral  language.  Although  inference  did 
not  quite  reach  the  conventional  statistical  significance  (p  =  .058), 
the  overall  structure  of  relations  found  in  the  present  study  is  in 
line  with  previous  studies  (Kim,  2015,  2016).  These  results  indi¬ 
cate  that  the  discourse-level  oral  language  is  an  upper-level  skill, 
predicted  by  not  only  the  ability  to  use  vocabulary  and  to  combine 
words  to  represent  meanings  (grammatical  knowledge),  but  also  by 
higher-order  cognitive  skills  to  connect  propositions,  and  to  un¬ 
derstand  other’s  thoughts  and  take  perspectives  (Florit,  Roch, 
Altoe,  &  Levorato,  2009;  Florit  et  al.,  2014;  Kendeou  et  al.,  2008; 
Lepola  et  al.,  2012;  Strasser  &  del  Rio,  2014;  Tompkins,  Guo,  & 
Justice,  2013).  Furthermore,  higher-order  cognitive  skills  are  pre¬ 
dicted  by  foundational  language  and  cognitive  skills,  convergent 


with  previous  studies  (Carlson,  Moses,  &  Breton,  2002;  Kim, 
2015,  2016;  Kim  &  Phillips,  2014;  Slade  &  Ruffman,  2005).  It  is 
worth  noting  that  previous  investigations  of  component  skills  of 
discourse  language  involved  “comprehension,”  whereas  in  the 
present  study  we  expanded  it  to  discourse  language  “generation” 
or  “production.”  Convergent  results  for  comprehension  and  pro¬ 
duction  are  in  line  with  the  direct  and  mediated  model  in  Figure  1 
and  the  construction-integration  model  (Kintsch,  1988),  as  both  of 
these  models  incorporate  comprehension  and  production  at  the 
discourse  level. 

When  it  comes  to  the  direct  and  indirect  relations  model  of 
writing  (DIEW),  a  complete  mediation  model  described  the  data 
best.  Discourse-level  oral  language  and  transcription  skills  (spell¬ 
ing  and  handwriting  fluency)  had  direct  relations  to  writing.  In 
contrast,  all  the  other  language  and  cognitive  component  skills 
were  indirectly  related  to  writing  via  discourse-level  oral  language 
and  transcription  skills.  Moreover,  discourse-level  oral  language 
had  a  substantial — and  in  fact,  the  largest — direct  effect  on  writing 
(.46).  Transcription  skills  also  had  sizable  effects  on  writing  (.37 
for  spelling  and  .17  for  handwriting  fluency). 

Working  memory  was  found  to  be  a  foundational  cognitive 
capacity  for  component  language  and  cognitive  skills.  It  was 
directly  related  to  foundational  oral  language  skills  (vocabulary 
and  grammatical  knowledge),  higher-order  cognitive  skills  (infer¬ 
ence  and  theory  of  mind),  and  transcription  skills  (spelling  and 
handwriting  fluency).  Furthermore,  it  appears  that  working  mem¬ 
ory  constrains  discourse-level  oral  language  even  after  accounting 
for  the  effects  of  foundational  oral  language  and  higher-order 
cognitive  skills.  Producing  coherent  oral  text  at  the  discourse  level 
places  a  great  demand  on  working  memory,  as  the  interlocutor  has 
to  temporarily  hold  propositions  and  ideas  while  simultaneously 
generating  and  interconnecting  ideas  for  flow  and  logic.  Impor¬ 
tantly,  however,  working  memory  was  no  longer  directly  related  to 
writing  once  all  the  language  and  cognitive  skills  were  accounted 
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Figure  6.  Standardized  path  coefficients  showing  the  relations  of  oral  language  and  cognitive  component  skills 
of  discourse-level  oral  language,  discourse-level  oral  language,  spelling,  and  sentence  copying  fluency  to 
writing.  Solid  lines  represent  statistically  significant  relations  whereas  dashed  lines  represent  nonsignificant 
relations.  Gray  lines  represent  covariances.  TNL  =  test  of  narrative  language;  Exp  =  Expository  texts;  Oral 
Lang  =  oral  language;  ToM  =  theory  of  mind;  Grammar  =  Grammatical  knowledge. 


for.  Despite  its  indirect  nature,  though,  the  total  effect  of  working 
memory  on  writing  was  substantial  (.43),  suggesting  that  working 
memory  is  one  of  the  key  cognitive  abilities  that  underpin  writing 
skill. 

The  present  findings  also  revealed  that  a  higher-order  cognitive 
skill,  inference,  was  independently  related  to  writing,  after  accounting 
for  theory  of  mind  and  transcription  skills,  suggesting  that  children’s 
ability  to  connect  ideas  and  propositions  to  background  knowledge  is 
important  to  writing  quality.  As  stated  above,  interconnecting  propo¬ 
sitions  and  ideas  are  important  for  establishing  global  coherence 

Table  3 


Direct ,  Indirect,  and  Total  Effects  of  Language  and  Cognitive 
Skills  (Standard  Error )  on  Writing  Based  on  the  Results  in 
Figure  6 


Variable 

Direct  effect 

Indirect  effect 

Total  effect 

Discourse-level  oral  language 

.46  (.09) 

— 

.46 

Spelling 

.37  (.087) 

— 

.37 

Handwriting 

.17  (.094) 

— 

.17 

Inference 

— 

.10  (.057) 

.10 

Theory  of  mind 

— 

.12  (.049) 

.12 

Vocabulary 

— 

.19  (.053) 

.19 

Grammatical  knowledge 

— 

.10  (.046) 

.10 

Working  memory 

— 

.43  (.062) 

.43 

across  the  text.  Although  novice  writers  may  not  exhibit  the  sophis¬ 
ticated  writing  strategies  found  in  expert  writers  (e.g.,  elaborated 
planning  or  revising),  children’s  inferencing  ability  appears  to  be 
important  to  writing  quality.  However,  these  results  do  not  negate  the 
importance  of  theory  of  mind  to  writing,  as  it  appears  that  the  effect 
of  theory  of  mind  on  writing  is  largely  indirect,  shared  with  inference 
(see  fairly  strong  bivariate  correlation,  r  =  .61). 

Our  findings  further  highlight  that  the  effects  of  higher-order  cog¬ 
nitive  skills  are  primarily  mediated  by  discourse-level  oral  language 
skill.  In  a  similar  vein,  the  relations  of  foundational  oral  language 
skills  (such  as  vocabulary  and  grammatical  knowledge)  to  writing 
were  completely  mediated  by  discourse-level  oral  language.  Although 
previous  studies  have  shown  the  relations  of  vocabulary  and  gram¬ 
matical  knowledge  to  writing  after  accounting  for  transcription  skills 
(Kim  et  al„  2014,  2011;  Olinghouse,  2008),  these  studies  did  not 
include  discourse-level  oral  language  skills. 

The  DIEW  model  is  in  line  with  the  simple  view  and  not-so-simple 
view  of  writing,  but  expands  them  in  several  important  ways.  First, 
the  model  explicitly  specified  direct  and  indirect  relations  among 
component  skills  and  their  relations  to  writing.  In  particular, 
discourse-level  oral  language  and  transcription  skills  are  upper-level 
skills  that  subsume  a  complex  array  of  component  skills.  A  large  body 
of  previous  studies  has  shown  component  skills  of  transcription  skills 
(Apel,  Wilson-Fowler,  Brimo,  &  Perrin,  2012;  Bourassa,  Treiman,  & 
Kessler,  2006;  Deacon  &  Bryant,  2005;  Kim,  2010;  Kim,  Apel,  &  A1 


DEVELOPMENTAL  MODEL  OF  WRITING 


45 


Otaiba,  2013;  Nagy,  Beminger,  Abbott,  Vaughan,  &  Vermeulen, 
2003;  Roman,  Kirby,  Parrila,  Wade-Woolley,  &  Deacon,  2009; 
Treiman,  1993),  and  the  present  study  showed  component  skills  of 
discourse-level  oral  language,  in  line  with  recent  evidence  (Kim, 
2015,  2016;  Lepola  et  al„  2012;  Tompkins  et  ah,  2013).  Although  the 
hypothesis  that  writing  draws  on  two  skills,  discourse-level  oral 
language  and  transcription  skills,  might  appear  to  be  too  simple,  a 
close  look  reveals  a  complex  picture  of  multiple  skills  involved  in 
these  two  upper-level  skills.  Second,  in  the  DIEW  model,  working 
memory  was  explicitly  hypothesized  to  be  a  foundational  cognitive 
capacity  that  supports  other  component  skills.  The  present  study 
indicates  its  essential  role  in  other  component  skills,  and  showed  that 
its  relation  to  writing  is  primarily  mediated  by  other  component  skills. 
Third,  in  the  DIEW  model,  writing  component  skills  are  hypothesized 
to  be  correlated  and  not  orthogonal  (also  see  Hayes,  1996  for  a  similar 
view).  For  instance,  foundational  oral  language  skills  such  as  vocab¬ 
ulary  and  grammatical  knowledge  have  been  shown  to  be  correlated 
(Brimo,  Apel,  &  Fountain,  2015;  Conboy  &  Thai,  2006;  Hagtvet, 
2003;  Kim,  2015,  2016).  Similarly,  these  oral  language  skills  have 
also  been  correlated  with  transcription  skills  (Cunningham  &  Stanov- 
ich,  1991;  Kim,  Apel  et  al.,  2013;  Yeong,  &  Liow,  201 1).  Therefore, 
these  skills  are  dissociable  but  correlated. 

Limitations,  Implications,  and  Conclusion 

The  results  of  the  present  study  should  be  interpreted  with  the 
current  design  in  mind  such  as  included  predictors  and  sample. 
Several  limitations  and  related  future  directions  are  worth  noting. 
First,  it  would  have  been  ideal  to  include  other  known  predictors  of 
writing.  In  particular,  the  not-so-simple  view  of  writing  specifies 
self-regulatory  factors  such  as  attention  and  goal  setting,  and  there¬ 
fore,  future  studies  including  these  factors  would  be  informative. 
Whether  these  skills  form  a  separate  factor  or  their  contributions  to 
writing  are  indirect  via  discourse-level  oral  language  and  transcription 
skills  is  an  open  question.  For  instance,  a  recent  study  suggested  that 
the  relation  of  attentional  control  to  discourse-level  oral  language  is 
primarily  indirect  via  other  component  skills  (e.g.,  vocabulary  and 
grammatical  knowledge;  Kim,  2016).  Second,  due  to  practical  con¬ 
straints,  we  were  not  able  to  administer  multiple  measures  per  con¬ 
struct  and  use  latent  variables,  which  is  ideal.  Third,  reliabilities 
estimate  of  working  memory  (.76)  and  theory  of  mind  (.79)  did  not 
quite  reach  the  typically  desired  value  of  .80.' 

Future  directions  include  replicating  the  present  findings  with  chil¬ 
dren  in  different  developmental  phases  of  writing.  As  children  de¬ 
velop  their  writing  skills,  the  nature  of  relations  and  relative  impor¬ 
tance  of  various  component  skills  might  vary.  For  instance,  the 
relations  of  higher-order  cognitive  skills  to  oral  language  and  to 
writing  might  be  stronger  for  older  children  as  their  cognitive  skills 
are  further  developed  and  writing  tasks  become  more  demanding. 
Moreover,  it  would  be  informative  to  replicate  the  present  study  with 
a  larger  sample  size.  Although  the  sample  size  was  overall  sufficient 
to  detect  patterns  of  relations,  some  nonsignificant  relations  (e.g., 
inference  to  discourse-level  language;  see  Figure  4)  might  bfe  partly 
due  to  the  sample  size.  Finally,  in  the  present  study,  we  examined  the 
DIEW  model  for  writing  quality  (operationalized  as  idea  develop¬ 
ment).  An  important  way  to  expand  the  DIEW  model  is  to  examine 
the  relations  of  component  skills  to  different  writing  outcomes.  For 
instance,  recent  studies  have  shown  that  writing  quality  and  produc¬ 
tivity  are  associated  but  separable  dimensions  (Kim  et  al.,  2014;  Kim, 


Al  Otaiba  et  al.,  2015;  Puranik,  Lombardino,  &  Altmann,  2008; 
Wagner  et  al.,  201 1),  and  the  relation  of  component  skills  to  writing 
varies  for  different  writing  outcomes  (Kim  et  al.,  2014;  Kim,  Al 
Otaiba  et  al.,  2015). 

The  present  findings  offer  some  preliminary  yet  important  impli¬ 
cations.  First,  there  is  a  complex  array  of  potential  sources  of  break¬ 
down  in  writing  development.  Therefore,  in  order  to  find  out  locus  of 
writing  failure,  discourse-level  oral  language  and  transcription  skills 
should  be  assessed  and  targeted  in  instruction — children  may  be  weak 
in  discourse-level  oral  language  or  transcription  skills,  or  in  both. 
Importantly,  further  assessments  can  be  conducted  to  find  out  sources 
of  weaknesses  in  discourse-level  oral  language  and/or  transcription 
skills,  and  provide  targeted  instruction  based  on  the  child’s  profiles  of 
strengths  and  weaknesses.  For  transcription  skills,  phonological,  or¬ 
thographic,  and  morphological  awareness  can  be  included  (Apel  et  al., 
2012;  Bourassa  et  al.,  2006;  Deacon  &  Bryant,  2005;  Kim,  2010;  Kim 
et  al.,  2013;  Nagy  et  al.,  2003;  Treiman,  1993).  For  discourse-level 
oral  language  skill,  instruction  and  assessment  should  include  skills 
such  as  vocabulary,  grammatical  knowledge,  and  higher-order  cog¬ 
nitive  skills  such  as  making  inferences  and  perspective-taking.  Vo¬ 
cabulary  has  received  much  attention  as  part  of  oral  language  assess¬ 
ment  and  instruction  (e.g.,  Baumann  &  Kame’enui,  2004;  Biemiller, 
&  Boote,  2006;  Coyne,  McCoach,  &  Kapp,  2007;  Graves,  2006; 
Beck,  McKeown,  &  Kucan,  2002;  Silverman  &  Hartranft,  2015). 
Although  vocabulary  is  highly  important,  the  present  study,  as  well  as 
growing  evidence,  indicates  that  more  multifaceted  systematic  atten¬ 
tion  beyond  vocabulary  would  be  beneficial  to  improve  discourse 
level  oral  language. 

Together  with  previous  studies,  findings  of  the  present  study 
show  a  complex  array  of  skills  that  contribute  to  writing,  and  thus, 
development  of  writing  is  likely  to  require  development  of  multi¬ 
ple  language  and  cognitive  skills.  Future  longitudinal  studies  are 
warranted. 


1  For  Cohen’s  kappa  values  used  for  writing  and  discourse-level  oral 
language  skills,  .61-80  are  considered  substantial  and  .81-1.00  as  almost 
perfect  agreement  (Cohen,  1960). 
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Appendix  A 
Writing  Quality  Rubric 


Score 

Description 

0  (not  scorable) 

•  Protocol  is  blank 

1 

•  Handwriting  is  illegible 

•  Student  simply  rewrites  prompt  with  nothing  else 

•  Main  idea  is  not  relevant  to  the  prompt  or  no  topic  emerges  or  the  idea  is  difficult  to  understand. 

•  No  details  are  provided. 

2 

•  At  least  one  relevant  idea  is  represented  and  many  times,  one  simple  statement  captures  the  topic 

•  The  idea  is  conveyed  in  a  very  general  way  with  few  details 

•  The  writing  reads  as  a  list  of  activities. 

3 

•  The  writing  is  made  up  of  one  or  more  ideas  with  a  few  details. 

•  Flow  of  ideas  is  somewhat  choppy. 

•  The  writing  may  read  as  a  list  of  activities  and  a  few  places  might  be  repetitive. 

4 

•  A  sense  of  coherent  story  is  emerging  with  relatively  clear  main  idea  and  details,  and  the  writing  makes  a  point. 

•  Reads  somewhat  like  a  cohesive  story. 

•  The  writing  is  on  topic  but  could  be  narrower  and  more  focused. 

5 

•  One  clear  main  idea  is  developed  and  the  writing  reads  as  a  cohesive  story  in  general. 

•  Topic  is  narrow  and  focused  although  could  benefit  from  some  additional  work. 

•  Supportive  details  are  accurate  and  developed,  and  elaborated. 

•  The  writer  uses  relevant  and  interesting  details. 

( Appendices  continue ) 
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Appendix  B 

Results  of  Partial  Mediation  Models 


Figure  FA1.  The  relations  of  oral  language  (vocabulary  and  grammatical  knowledge)  to  writing  (a);  higher 
order  cognitive  skills  (inference  and  theory  of  mind)  to  writing  (b);  and  working  memory  to  writing  (c),  after 
accounting  for  discourse-level  oral  language,  spelling,  and  sentence  copying  fluency.  Solid  lines  represent 
statistically  significant  relations  whereas  dashed  lines  represent  nonsignificant  relations.  Gray  lines  represent 
covariances.  TNL  =  test  of  narrative  language;  Exp  =  Expository  texts;  Oral  Lang  =  oral  language;  ToM= 
theory  of  mind;  Grammar  =  Grammatical  knowledge. 
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Models  of  irregular  word  reading  that  take  into  account  both  child-  and  word-level  predictors  have  not  been 
evaluated  in  typically  developing  children  and  children  with  reading  difficulty  (RD).  The  purpose  of  the 
present  study  was  to  model  individual  differences  in  irregular  word  reading  ability  among  5th  grade  children 
(N  =  170),  oversampled  for  children  with  RD,  using  item-response  crossed  random-effects  models.  We 
distinguish  between  2  subtypes  of  children  with  word  reading  RD,  those  with  early  emerging  and  late- 
emerging  RD,  and  2  types  of  irregular  words,  “exception”  and  “strange.”  Predictors  representing  child-level 
and  word-level  characteristics,  along  with  selected  interactions  between  child-  and  word-characteristics,  were 
used  to  predict  item-level  variance.  Individual  differences  in  irregular  word  reading  were  predicted  at  the  child 
level  by  nonword  decoding,  orthographic  coding,  and  vocabulary;  at  the  word  level  by  word  frequency  and 
a  spelling-to-pronunciation  transparency  rating;  and  by  the  Reader  group  X  Imageability  and  Reader  group  X 
Irregular  word  type  interactions.  Results  are  interpreted  within  a  model  of  irregular  word  reading  in  which 
lexical  characteristics  specific  to  both  child  and  word  influence  accuracy. 

Keywords:  individual  differences,  irregular  word  reading,  reading  difficulty 


An  essential  development  in  learning  to  read  is  the  acquisition 
of  automatic  word  reading  skills  (defined  in  this  study  as  the 
ability  to  pronounce  written  words  in  isolation)  that  are  impene¬ 
trable  to  factors  such  as  knowledge  and  expectation  (Perfetti,  1992; 
Stanovich,  1991).  Automaticity  of  word  reading  allows  fluent  and 
reliable  retrieval  of  word  representations  from  the  orthographic 
lexicon,  activating  phonological,  syntactic,  morphological,  and 
semantic  information  to  be  used  by  the  reader  to  form  faithful 
representations  of  text  (e.g.,  Kintsch  &  Rawson,  2005;  Perfetti, 
Landi,  &  Oakhill,  2005).  As  children  learn  to  read,  the  ortho¬ 
graphic  lexicon  expands  via  an  increase  in  the  absolute  number  of 
orthographically  addressable  entries,  referred  to  as  “word-specific” 
representations  (Castles  &  Nation,  2006;  Ehri,  2014;  Perfetti  & 
Stafura,  2014).  Word-specific  representations  are  considered  to  be 


less  dependent  on  phonological  processes  because  these  represen¬ 
tations  have  been  supplanted  by  specific  connections  linking  spell¬ 
ing  directly  to  pronunciations  (Perfetti,  1992;  Share,  1995). 

The  addition  of  word-specific  representations  in  developing  readers 
likely  depends  on  both  child-  (Vellutino,  Fletcher,  Snowling,  &  Scan¬ 
lon,  2004)  and  word-level  (Balota  et  al.,  2007;  Seidenberg,  Waters, 
Barnes,  &  Tanenhaus,  1984)  factors.  Children  add  word-specific 
entries  to  the  orthographic  lexicon,  to  a  large  extent,  by  employing  a 
phonologically  based  recoding  mechanism.  Phonological  recoding  is 
the  process  of  translating  a  printed  word  to  speech  by  employing 
grapheme-phoneme  correspondence  rules.  Thus,  phonological  recod¬ 
ing  functions  as  a  self-teaching  mechanism  (Jorm  &  Share,  1983; 
Share,  1995;  Share  &  Stanovich,  1995)  that  builds  the  orthographic 
lexicon  through  item-specific  learning  (see  Nation,  Angell,  &  Castles, 
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2007;  Wang,  Nickels,  Nation,  &  Castles,  2013).  However,  the  addi¬ 
tion  of  word-specific  orthographic  representations  in  English  is  likely 
modulated  by  child-level  factors  beyond  phonological-based  recoding 
skill  (i.e.,  phonological  awareness  and  nonword  decoding  skills)  that 
include,  but  are  not  limited  to,  vocabulary  knowledge,  orthographic 
processing,  rapid  automatized  naming,  and  print  experience  (see  Cun¬ 
ningham,  Perry,  &  Stanovich,  2001;  Harm  &  Seidenberg,  2004; 
Keenan  &  Betjemann,  2008;  Nation  &  Snowling,  1998;  Plaut,  Mc¬ 
Clelland,  Seidenberg,  &  Patterson,  1996).  In  addition,  word-level 
features  (e.g.,  frequency,  length,  regularity,  and  imageability)  likely 
affect  the  ease  with  which  words  are  added  to  the  orthographic 
lexicon  as  well  (see  Coltheart,  Laxon,  &  Keating,  1988;  Wang, 
Nickels,  Nation,  &  Castles,  2013;  Waters,  Brack,  &  Seidenberg, 
1985;  Waters,  Seidenberg,  &  Brack,  1984). 

Because  English  orthography  is  semiopaque  (i.e.,  spelling-to- 
sound  relationships  are  not  consistent),  words  are  often  character¬ 
ized  as  being  regular  or  irregular.  Although  this  binary  distinction 
between  regular  and  irregular  words  is  not  totally  accurate,  English 
orthography  is  best  described  as  quasi-regular  (see  Plaut,  1999; 
Seidenberg,  2005),  it  is  relevant  within  a  self-teaching  framework 
describing  the  development  of  the  orthographic  lexicon  in  children 
(Wang,  Castles,  &  Nickels,  2012).  From  a  self-teaching  perspec¬ 
tive,  irregular  words  are  challenging  for  developing  readers  by 
requiring  them  to  grapple  with  the  quasi-regular  system  governing 
English  orthography  (Plaut  et  al.,  1996).  As  a  result,  studies  have 
shown  that  it  is  harder  for  children  to  acquire  orthographic  repre¬ 
sentations  of  irregular  words  compared  with  regular  words,  and 
these  orthographic  representations  may  be  less  precise  until  they 
become  fully  automatized  (see  Wang  et  al.,  2012;  Wang,  Castles, 
Nickels,  &  Nation,  2011;  Wang  et  al.,  2013). 

The  purpose  of  the  present  study  was  to  develop  a  comprehensive 
model  of  irregular  word  reading  skill  in  developing  readers  that 
simultaneously  takes  into  account  both  child-  and  word-level  predic¬ 
tors,  emphasizes  performance  differences  between  typically  develop¬ 
ing  (TD)  children  and  children  with  reading  disability  (RD),  and 
focuses  on  the  role  of  lexical  influences  on  word  reading  accuracy. 
This  study  was  motived  by  results  from  computational  models  (Plaut 
et  al.,  1996),  experimental  learning  studies  (Wang  et  al.,  2011,  2012, 
2013),  and  comparison  studies  (Waters,  Seidenberg,  &  Brack,  1984) 
of  irregular  word  reading  suggesting  an  important  interplay  between 
word  regularity,  word  frequency,  child  phonological  skills  (i.e.,  pho¬ 
nemic  awareness  and  nonword  decoding)  and  child  semantic  knowl¬ 
edge.  For  instance,  Wang  et  al.  (2013)  concluded 

when  phonological  decoding  can  be  only  partially  successful,  as  was 
the  case  with  irregular  words  in  this  study,  orthographic  learning  was 
assisted  by  factors  such  as  vocabulary  knowledge.  On  the  other  hand, 
when  phonological  decoding  is  not  compromised,  vocabulary  knowl¬ 
edge  does  not  appear  to  provide  additional  assistance  to  acquiring 
orthographic  representations,  (p.  14) 

Plaut  et  al.  further  expand  on  the  relationship  between  child 
phonological  skills,  child  semantic  knowledge,  word  frequency, 
and  word  regularity  by  stating 

the  reading  system  learns  gradually  to  be  sensitive  to  the  statistical 
structure  among  orthographic,  phonological,  and  semantic  represen¬ 
tations  and  these  representations  simultaneously  constrain  each  other 
in  interpreting  a  given  input  ...  As  a  result,  words  with  a  relatively 
weak  semantic  contribution  (e.g.,  abstract  or  low-imageability  words) 


exhibit  a  stronger  frequency  by  consistency  interaction — in  particular, 
naming  latencies  and  error  rates  are  disproportionately  high  for  items 
that  are  weak  on  all  three  dimensions:  abstract  [low-imageability], 
low-frequency,  exception  words,  (pp.  99-101) 

Thus,  these  studies  suggest  an  important  interplay  between 
word-level  and  child-level  factors  in  explaining  irregular  word 
reading  variance. 

Our  study  is  unique  in  the  sense  that  it  combines  child-,  word-, 
and  child  by  word  interactions  into  a  single  model  of  irregular 
word  reading.  Very  few  studies  have  combined  both  child-  and 
word-level  predictors  into  a  single  model  predicting  item-level 
variance  in  word  reading  ability.  Those  that  have  include  models 
of  nonword  reading  (Gilbert,  Compton,  &  Kearns,  2011),  morpho¬ 
logically  complex  word  reading  (Goodwin,  Gilbert,  &  Cho,  2013; 
Kearns  et  al.,  2014),  and  multisyllabic  (Kearns,  2015)  word  read¬ 
ing.  A  comprehensive  model  of  irregular  word  reading  has  the 
potential  to  provide  important  insights  into  the  relationships  be¬ 
tween  child  and  word  factors  that  affect  word  reading  develop¬ 
ment.  Such  a  model  also  affords  the  opportunity  to  explore  poten¬ 
tially  important  child  by  word  interactions  that  might  ultimately 
inform  studies  designed  to  explore  how  to  improve  the  word 
reading  abilities  of  straggling  readers.  We  proceed  by  providing  a 
brief  review  of  the  literature  that  influenced  the  selection  of  word 
and  child  factors  included  in  the  model. 

Word-Level  Predictors 

In  the  present  study  we  attempted  to  include  a  comprehensive 
set  of  word-level  factors  known,  or  suspected,  to  predict  individual 
differences  in  irregular  word  reading  among  developing  readers. 
The  predictors  included  regularity,  frequency,  length,  imageabil¬ 
ity,  orthographic  neighborhood  size,  and  spelling-to-pronunciation 
transparency. 

Regularity 

A  sizable  literature  exists  examining  the  effects  of  regularity  on 
word  reading  accuracy  in  typically  developing  and  RD  children 
(see  Metsala,  Stanovich,  &  Brown,  1998).  Words  such  as  made 
and  best  are  considered  regular  because  their  pronunciations  are 
predictable  on  the  basis  of  simple  spelling-sound  rules  (Rastle  & 
Coltheart,  1999;  Venezky,  1999),  and  all  words  with  similar  rime 
patterns  (- ade ,  -est)  rhyme.  Words,  such  as  have  and  give,  are 
considered  irregular  because  their  pronunciations  violate  simple 
spelling-sound  correspondences,  and  they  have  no  rhymes  with 
similar  pronunciations  (Glushko,  1979).  This  class  of  irregular 
words  has  often  been  referred  to  as  “exception”  words.  Waters, 
Seidenberg,  and  colleagues  (Seidenberg,  Waters,  Barnes,  & 
Tanenhaus,  1984;  Waters  et  al.,  1985;  Waters  et  al.,  1984;  Waters 
&  Seidenberg,  1985)  further  subdivided  irregular  words  by  includ¬ 
ing  a  strange  word  class.  Strange  words  (e.g.,  yacht)  have  irregular 
pronunciations  like  exception  words,  but  unlike  exception  words 
they  also  contain  spelling  patterns  that  occur  in  very  few  or  no 
other  English  words  (i.e.,  strange  words  do  not  share  rime  patterns 
with  other  words).  Because  strange  words  have  deviations  in  both 
orthographic  and  phonological  representation  it  has  been  sug¬ 
gested  that  they  may  be  processed  differently  from  exception 
words  (Seidenberg  et  al.,  1984). 
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Studies  have  documented  differential  effects  of  word  class  (i.e., 
regular,  exception,  and  strange)  and  word  frequency  as  a  function  of 
reading  skill  in  developing  readers  (Waters,  Seidenberg,  &  Brack, 
1984).  In  general,  TD  readers  made  more  errors  in  recognizing  ex¬ 
ception  words  than  regular  and  strange  words,  whereas  children  with 
RD  made  more  errors  in  recognizing  strange  and  exception  words 
compared  to  regular  words  (Waters  et  ah,  1985).  In  the  present  study 
we  coded  our  irregular  words  as  either  exception  or  strange. 

Frequency 

Research  examining  individual  differences  in  word  reading 
among  developing  readers  report  fairly  robust  interactions  between 
word  frequency  and  reading  skill.  Compared  with  TD  children,  RD 
children  tend  to  show  greater  effects  of  word  frequency  on  word 
reading  accuracy  (see  Kuperman  &  Van  Dyke,  2013;  Waters  et  ah, 
1984,  1985).  Results  suggest  skilled  readers  are  better  able  than 
struggling  readers  to  identify  a  large  pool  of  high  frequency  words 
without  interference  from  irregular  spelling-sound  correspon¬ 
dences  (Waters  et  ah,  1985).  We  include  an  estimate  of  printed 
word  frequency  for  our  irregular  words  as  estimated  by  Zeno, 
Ivens,  Millard,  and  Duvvuri  (1995). 

Length 

In  developing  readers  word  length  is  related  to  both  the  speed  and 
accuracy  of  word  reading,  particularly  in  more  transparent  orthogra¬ 
phies  (e.g.,  Bijeljac-Babic,  Millogo,  Farioli,  &  Grainger,  2004;  De 
Luca,  Barca,  Burani,  &  Zoccolotti,  2008).  In  addition,  stronger  word 
length  effects  have  been  reported  for  children  with  RD  (Martens  &  de 
Jong,  2006;  Ziegler,  Perry,  &  Coltheart,  2003;  Zoccolotti  et  ah,  2005), 
perhaps  reflecting  impairment  in  the  application  of  larger  ortho¬ 
graphic  units  in  parallel  when  reading  unfamiliar  words  (Coltheart, 
Rastle,  Perry,  Langdon,  &  Ziegler,  2001;  Hautala,  Aro,  Eklund, 
Lerk-kanen,  &  Lyytinen,  2013).  Given  that  we  were  interested  in 
differences  between  TD  and  RD  children  on  irregular  word  reading 
we  included  number  of  letters  as  a  predictor. 

Imageability 

Word  imageability  has  also  been  shown  to  be  a  strong  predictor  of 
word  reading  accuracy  in  developing  readers  (Laing  &  Hulme,  1999). 
Imageability  is  a  word  feature  that  captures  the  ease  with  which  a 
word  can  elicit  a  mental  image  in  the  reader  (Paivio,  Yuille,  & 
Madigan,  1968).  To  date,  some  studies  have  reported  a  main  effect  for 
imageability  on  word  reading  accuracy  with  no  interaction  for  regu¬ 
larity  (Duff  &  Hulme,  2012;  Laing  &  Hulme,  1999;  Monaghan  & 
Ellis,  2002)  whereas  others  have  reported  that  the  effects  of  image- 
ability  have  been  isolated  to  irregular  words  (e.g.,  Strain  &  Herdman, 
1999;  Strain,  Patterson,  &  Seidenberg,  2002).  Others  have  found  that 
imageability  is  particularly  important  for  poor  readers  (Coltheart, 
Laxon,  &  Keating,  1988;  Steacy  et  al.,  2013).  Some  have  suggested 
that  imageability  affects  irregular  word  reading  by  eliciting  an  item- 
specific  mental  image  associated  with  the  word  that  helps  to  facilitate 
the  formation  of  word-specific  orthographic-to-phonological  connec¬ 
tions  in  irregular  words  and  leads  to  the  establishment  of  more  stable 
word  representations  (see  Keenan  &  Betjemann,  2008).  Because  of 
the  conflicting  evidence  regarding  the  interaction  between  imageabil¬ 
ity  and  regularity,  we  have  included  it  as  a  predictor  in  our  model  of 
irregular  word  reading. 


Orthographic  Neighborhood  Size 

The  most  frequently  used  measure  to  investigate  the  influence  of 
similar  orthographic  representations  on  word  reading  performance  is 
sensitivity  to  orthographic  neighborhood  size.  The  orthographic 
neighborhood  size  of  a  given  word  represents  all  of  the  existing  words 
that  can  be  created  by  substituting  one  of  its  letters  for  another  one 
(Coltheart,  Davelaar,  Jonasson,  &  Besner,  1977).  Studies  have  found 
that  orthographic  neighborhood  size  is  related  to  word  reading  and  has 
more  of  an  effect  on  familiar  than  unfamiliar  words  (Laxon,  Gal¬ 
lagher,  &  Masierson,  2002;  Laxon,  Masterson,  &  Moran,  1994). 
Laxon,  Coltheart,  and  Keating  (1988)  reported  that  poor  readers 
exhibited  larger  orthographic  neighborhood  size  effects  for  words 
than  good  readers.  For  this  reason  we  include  a  measure  of  ortho¬ 
graphic  neighborhood  size  as  a  word-level  predictor. 

Spelling-to-Pronunciation  T ransparency 

It  is  clear  that  irregular  words  vary  significantly  in  terms  of  the 
distance  between  regularized  decoding  pronunciation  and  the  ac¬ 
tual  phonological  representation  in  the  lexicon  (known  as  partial 
decoding).  For  example,  the  distance  from  decoding  pronunciation 
to  the  phonological  representation  for  irregular  words  like  yacht 
and  suede  appear  larger  than  words  like  pint  and  touch  and  this 
may  affect  the  ability  of  developing  readers  to  efficiently  add 
irregular  items  to  their  orthographic  lexicons  (see  Plaut  et  al., 
1996).  Yet,  this  distance  has  not  been  used  as  a  word-level  char¬ 
acteristic  to  explain  irregular  word  reading  variance  in  developing 
readers.  In  our  study,  we  build  off  of  the  set  for  variability 
paradigm  (Elbro  et  al.,  2012;  Tunmer,  &  Chapman,  2012)  by 
asking  experts  to  rate  the  ease  of  determining  the  lexical  represen¬ 
tation  of  the  irregular  words  based  on  faithful  application  of 
decoding  rules.  The  assumption  was  that  the  smaller  the  distance 
the  simpler  it  would  be  to  make  the  match  between  partial  decod¬ 
ing  and  the  actual  lexical  representation,  and  thus  the  easier  it 
would  be  to  add  the  word  to  the  orthographic  lexicon.  We  included 
a  spelling-to-pronunciation  transparency  rating  as  a  word-level 
predictor  in  the  model. 

Child-Level  Predictors 

In  addition  to  word  features,  child  skills  likely  to  play  an 
important  role  in  predicting  irregular  word  reading  accuracy  were 
included  in  the  model.  We  include  a  comprehensive  set  of  child- 
level  factors  known  to  predict  individual  differences  in  the  word 
reading  of  developing  readers.  The  predictors  included  reader 
group,  phonological  awareness,  nonword  decoding,  vocabulary, 
orthographic  coding,  rapid  automatized  naming,  and  a  measure  of 
print  exposure. 

Reader  Group 

There  is  a  growing  body  of  literature  to  suggest  that  some  students 
who  exhibit  typical  reading  skill  growth  in  the  early  years  develop 
difficulty  with  word  reading  later  in  elementary  school  as  the  de¬ 
mands  of  reading  increase  (Catts,  Compton,  Tomblin,  &  Bridges, 
2012;  Leach,  Scarborough,  &  Rescorla,  2003;  Lipka,  Lesaux,  & 
Siegel,  2006).  One  potential  contributor  to  these  unexpected  reading 
difficulties  may  be  the  increasing  demands  placed  on  the  reader  at  the 
word  level,  as  students  are  expected  to  read  more  complex  text  and 
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they  are  faced  with  new  words  many  of  which  are  lower  in  frequency 
and  contain  irregular  spelling  patterns  and  thus  require  the  formation 
of  complex  connections  between  phonology,  orthography,  and  se¬ 
mantics  (Catts  et  al.,  2012;  Leach  et  al.,  2003).  In  the  present  study, 
we  were  particularly  interested  in  whether  differences  exist  across 
reader  groups  representing  TD,  late-emerging  RD,  and  early  emerg¬ 
ing  RD  on  irregular  word  reading.  Late-emerging  RD  refers  to  stu¬ 
dents  who  have  been  identified  with  reading  difficulties  later  in 
elementary  school  while  early  emerging  RD  refers  to  students  that 
have  been  identified  in  first  grade. 

Phonological  Awareness 

Both  cross-sectional  and  longitudinal  studies  indicate  that  phono¬ 
logical  awareness  (PA)  has  a  high  correlation  with  and  accounts  for 
unique  variance  in  word  reading  (Kirby,  Parrila,  &  Pfeiffer,  2003; 
Melby-Lervag,  Lyster,  &  Hulme,  2012).  Deficits  in  phonological 
processing  skills  have  been  causally  linked  to  poor  word- 
identification  skills  through  a  mechanism  that  disrupts  the  develop¬ 
ment  of  decoding  skills  (Brady  &  Schankweiler,  1991;  Bruck,  1992; 
Siegel  &  Faux,  1989;  Stanovich  &  Siegel,  1994;  Torgesen,  2000; 
Vellutino  et  al.,  1996).  Deficits  in  the  ability  to  recognize  and  manip¬ 
ulate  the  phonemes  of  words  are  believed  to  disrupt  the  acquisition  of 
spelling-to-sound  translation  routines  that  form  the  basis  of  early 
decoding-skill  development  (Bus  &  van  IJzendoom,  1999;  van  Uzen- 
doom  &  Bus,  1994;  Rack,  Snowling,  &  Olson,  1992).  For  this  reason, 
we  include  PA  as  a  child-level  predictor  in  the  model. 

Nonword  Decoding 

Nonword  reading  is  considered  a  proxy  of  decoding  skill.  Even 
though  irregular  words  are  not  decodable  there  is  a  close  relation¬ 
ship  between  decoding  skill  and  irregular  word  reading  with  the 
association  being  stronger  in  TD  compared  to  RD  children  (Grif¬ 
fiths  &  Snowling,  2002).  Griffiths  and  Snowling  (2002)  speculated 
that,  “even  the  ability  to  read  words  that  do  not  conform  to  regular 
grapheme-phoneme  correspondences  depends  on  having  access  to 
segmental  phonological  representations”  (pp.  40-41).  Thus,  we 
incorporated  a  measure  of  nonword  decoding  in  the  irregular  word 
reading  model. 

Vocabulary 

Keenan  and  Betjemann  (2008)  have  speculated  that  semantic  acti¬ 
vation  may  help  to  “fill  voids”  in  phonological-orthographic  process¬ 
ing  in  individuals  with  poor  mappings,  such  as  children  with  RD  (p. 
193).  A  growing  literature  implicates  the  role  of  lexical  knowledge 
(e.g.,  semantic  or  lexical  phonology)  in  the  learning  of  irregular  words 
by  developing  readers  (see  Harm  &  Seidenberg,  2004;  Plaut  et  al., 
1996;  Ricketts,  Nation,  &  Bishop,  2007).  For  this  reason  we  included 
a  measure  of  vocabulary  ability  in  the  model. 

Orthographic  Coding 

Evidence  suggests  that  orthographic  coding  measures  a  skill 
distinct  from  phonological  decoding,  print  exposure,  and  other 
reading  related  skills  (e.g.,  Cunningham  &  Stanovich,  1991;  Ha- 
giliassis,  Pratt,  &  Johnston,  2006).  Orthographic  coding  measures 
have  been  shown  to  be  a  unique  predictor  of  children’s  word 
reading  after  controlling  for  PA  and  RAN  (Cunningham,  Perry,  & 


Stanovich,  2001).  Thus  we  include  orthographic  coding  as  a  child- 
level  measure. 

Rapid  Automatized  Naming 

Rapid  automatized  naming  (RAN)  of  familiar  stimuli  such  as 
objects,  colors,  digits,  or  letters  repeatedly  accounts  for  unique 
variance  in  both  concurrent  and  future  reading  and  spelling 
achievement  (for  a  review  see  Kirby  et  al.,  2010).  Research  indi¬ 
cates  that  these  relationships  hold  even  after  controlling  for  socio¬ 
economic  status,  IQ,  and  PA  (e.g.,  Kirby  et  al.,  2010;  Lervag  & 
Hulme,  2009)  and  that  students  with  good  and  poor  reading  skills 
differ  in  their  performance  on  RAN  tasks  (Bowers,  1995).  We 
therefore  include  RAN  as  a  child-level  predictor. 

Print  Exposure 

A  number  of  researchers  have  reported  evidence  in  support  of  an 
association  between  reading  experience  (indexed  by  measures  of 
print  exposure)  and  reading  skill  (Cunningham  &  Stanovich,  1991; 
McBride-Chang,  Manis,  Seidenberg,  Custodio,  &  Doi,  1993;  cf. 
Barker,  Torgesen,  &  Wagner,  1992).  Griffiths  and  Snowling 
(2002)  reported  that  in  developing  readers  print  exposure  was 
significantly  associated  with  irregular  word  reading  but  not  non¬ 
word  decoding.  A  measure  of  print  exposure  was  included  as  a 
child-level  predictor. 

Research  Questions 

In  the  present  study  we  ask  three  related  research  questions  regard¬ 
ing  irregular  word  reading  in  5th  grade  students.  The  first  explores 
whether  differences  exist  in  irregular  word  reading  between  classes  of 
children  identified  as  late-emerging  RD  and  those  identified  as  early 
emerging  RD  and  TD.  We  predict  an  order  effect  in  which  the  TD 
group  outperforms  both  RD  groups  and  the  late-emerging  RD  group 
outperforms  the  early  emerging  RD  group,  suggesting  that  the  early 
emerging  RD  group  has  more  severe  word  reading  difficulties.  The 
second  explores  the  relative  role  of  child-  and  word-level  character¬ 
istics  as  predictors  of  item-level  variance  on  irregular  word  reading 
accuracy,  with  a  particular  interest  in  the  role  of  child-  and  word-level 
lexical  influence.  We  predict  that  both  child-  and  word-level  features 
associated  with  lexical  processing  will  make  unique  and  significant 
contributions  to  item-level  irregular  word  reading  performance.  At  the 
child  level  we  hypothesize  that  vocabulary  will  uniquely  predict 
irregular  word  reading  performance  after  controlling  for  nonword 
decoding,  phonemic  awareness,  rapid  automatized  naming,  print  ex¬ 
posure,  and  orthographic  processing.  At  the  word  level  we  hypothe¬ 
size  that  imageability  and  a  rating  of  spelling-to-pronunciation  ease 
will  uniquely  predict  irregular  word  reading  performance  after  con¬ 
trolling  for  word  frequency,  word  length,  orthographic  neighborhood 
size,  and  orthographic  distinctiveness  (i.e.,  exception  vs.  strange 
word).  The  final  research  question  is  exploratory  in  nature,  investi¬ 
gating  the  importance  of  child-level  by  word-level  interactions  in 
explaining  irregular  word  reading  variance,  with  a  specific  focus  on 
interactions  between  RD  status  (i.e.,  late-emerging  RD,  early  emerg¬ 
ing  RD,  and  TD)  and  word-level  characteristics  (i.e.,  frequency, 
imageability,  spelling-to-pronunciation  transparency  rating,  and 
whether  an  irregular  word  is  strange).  Although  these  analyses  are 
exploratory,  we  were  particularly  interested  in  whether  differences 
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existed  between  RD  groups  in  terms  of  the  relative  importance  of 
word-  and  child-level  variables  associated  with  lexical  processing 
(i.e.,  imageability,  spelling-to-pronunciation  transparency  rating,  and 
vocabulary  knowledge)  in  predicting  item-level  irregular  word  read¬ 
ing. 

This  study  extends  the  literature  in  several  important  ways.  Our 
study  is  the  first  to  employ  item-response  crossed  random-effects 
models  (Bates,  Maechler,  &  Bolker,  2013)  to  explain  variability  in 
children  s  irregular  word  reading  at  the  item  level  using  a  compre¬ 
hensive  set  of  predictors.  In  doing  so  we  are  able  to  model  the  unique 
contributions  of  child,  word,  and  Child  X  Word  interactions  simul¬ 
taneously.  In  addition,  we  include  two  word-level  measures,  image- 
ability  and  a  spelling-to-pronunciation  transparency  rating,  which 
have  not  previously  been  incorporated  into  models  of  irregular  word 
reading.  Finally,  this  study  adds  to  the  literature  by  distinguishing 
between  three  important  reader  classes  based  on  word  reading  skill 
development  over  time,  those  who  have  early  emerging  RD,  late- 
emerging  RD,  and  TD.  In  proposing  a  comprehensive  model  of 
irregular  word  reading  we  were  particularly  interested  in  the  relative 
importance  of  various  lexical  processing  measures  at  the  word-level 
(imageability  and  spelling-to-pronunciation  transparency  ratings)  and 
child-level  (e.g.,  vocabulary  knowledge)  across  RD  groups  as  predic¬ 
tors  of  item-level  irregular  word  reading.  This  allows  us  to  interpret 
our  item-level  analyses  within  the  broader  literature  examining  irreg¬ 
ular  word  reading  in  TD  and  RD  children,  as  well  as  search  for 
“malleable  factors”  that  might  be  exploited  to  improve  the  irregular 
word  reading  abilities  of  struggling  readers. 

Method 

Participants 

Participants  were  drawn  from  a  multiyear  cohort  longitudinal 
study  examining  response-to-intervention  decision  rules  (see 
Compton  et  al.,  2010)  and  prevention  efficacy  (see  Gilbert  et  al., 
2013)  in  first  grade  children.  The  subjects  in  this  sample  are 
identical  to  those  reported  in  Kearns  (2015)  and  thus  the  sampling 
procedures  are  the  same.  For  this  study  children  were  assessed  at 
the  end  of  first  through  fourth  grades  on  measures  of  word  iden¬ 
tification  and  comprehension.  Latent  transition  analyses  were  used 
to  assign  each  child  to  either  RD  (late-emerging  RD  or  early 


emerging  RD)  or  TD  classes  as  a  function  of  time.  In  this  study 
late-emerging  RD  membership  was  defined  as  a  child  who  transi¬ 
tioned  from  an  initial  classification  of  TD  to  RD  over  time;  early 
emerging  RD  as  a  child  who  was  assigned  to  the  RD  class  at  the 
end  of  first  grade  and  remained  in  the  class  across  time;  and  TD  as 
a  child  who  was  assigned  to  the  TD  class  at  the  end  of  first  grade 
and  remained  in  that  class  over  time.  (A  very  small  number  of 
children  in  the  sample  transitioned  from  RD  to  TD  over  time  but 
this  group  was  not  included  in  this  study.)  In  the  case  of  word 
reading  (-W),  LTA  allowed  the  identification  of  classes  represent¬ 
ing  TD,  early  emerging  RD-W,  late-emerging  RD-W  and  for 
reading  comprehension  (-C)  classes  representing  TD,  early  emerg¬ 
ing  RD-C,  and  late-emerging  RD-C.  Results  from  the  two  LTA 
models  were  combined  to  further  identify  early  emerging  RD-CW 
and  late-emerging  RD-CW  classes.  Thus,  seven  latent  classes  were 
identified:  TD,  earl-emerging  RD-W,  -C,  -CW,  late-emerging 
RD-W,  -C,  and  -CW.  Counts  for  the  various  reading  classes 
identified  through  LTA  in  fourth  grade  as  a  function  of  cohort  are 
displayed  in  Table  1 .  We  then  selected  from  the  larger  fourth  grade 
sample  a  subsample  of  children  to  be  assessed  in  the  fall  of  fifth 
grade.  These  target  children  consented  to  three  1-hr  testing  ses¬ 
sions  measuring  reading,  language,  knowledge,  executive  func¬ 
tion,  and  attention.  Our  sampling  strategy  attempted  to  consent  all 
early  emerging  RD  and  late-emerging  RD  children  in  a  given 
cohort  and  then  to  randomly  select  TD  children  to  assess.  Table  1 
provides  the  number  of  children  who  were  consented  and  admin¬ 
istered  the  fifth-grade  battery.  Since  this  study  specifically  targeted 
word-reading  skills  we  only  selected  early  emerging  RD  and 
late-emerging  RD  classes  in  which  word  reading  difficulties  were 
present:  early  emerging  RD-W  (n  =  1),  early  emerging  RD-CW 
(n  —  18),  late-emerging  RD-W  (n  =  15),  and  late-emerging 
RD-CW  ( n  =  30)  along  with  TD  children  (n  =  109).  The  word- 
only  and  the  mixed  word  and  comprehension  difficulty  groups 
were  combined  for  the  early  emerging  RD  (n  =  19)  and  late- 
emerging  RD  ( n  =  45)  groups.  Three  children  had  missing  data  on 
some  measures  and  were  removed  from  the  analyses.  The  final 
sample  included  170  students  in  the  following  classes:  TD  (n  = 
108),  late-emerging  RD  (n  =  43),  and  early  emerging  RD  ( n  = 
19).  More  detail  regarding  the  Latent  Transition  Analyses  and  a 
description  of  the  specific  sampling  plan  for  each  of  the  cohorts  is 
provided  in  Appendix  A. 


Table  1 


Total  Sample  of  4th  Grade  Students,  5th  Grade  Students  Sampled,  and  Counts  of  the  Various 
Reading  Classes  Derived  From  Latent  Transitional  Analysis  as  a  Function  of  Cohort 


Reading 

class 

Cohort  1 

Cohort  2 

Cohort  3 

Total  sample 

4  th  grade/5  th  grade 

4th  grade/5th  grade 

4th  grade/5  th  grade 

4th  grade/5th  grade 

TD 

172/38 

64/30 

86/41 

322/109 

ERD-W 

1/1 

1/0 

1/0 

3/1 

ERD-C 

28/11 

23/10 

13/5 

64/26 

ERD-CW 

15/8 

10/4 

10/6 

35/18 

LERD-W 

10/6 

6/3 

8/6 

24/15 

LERD-C 

36/27 

35/12 

30/17 

101/56 

LERD-CW 

23/16 

18/4 

16/10 

57/30 

Total 

285/107 

157/63 

164/85 

606/235 

Note.  All  children  who  did  not  move  stayed  eligible  for  the  study.  ERD  =  Early  identified  reading  difficulty; 
LERD  =  Late  emerging  reading  difficulty;  TD  =  Typically  developing.  Bolded  5th  grade  numbers  represent  the 
sample  used  in  the  present  study. 
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Measures 

LTA  measures  (Grades  1-4). 

Word  identification.  Word  identification  was  measured  with  the 
Word  Identification  subtest  from  the  Woodcock  Reading  Mastery 
Tests-Revised/Normative  Update  (Woodcock,  1998).  For  this  task, 
children  were  asked  to  read  words  aloud  one  at  a  time.  The  test  was 
not  timed  but  children  were  encouraged  to  move  to  the  next  item  after 
a  5-s  silence.  Correct  pronunciations  were  counted  as  correct  and  the 
total  score  was  the  sum  of  correct  items.  Basal  and  ceiling  rules  were 
applied.  The  examiner’s  manual  reports  the  split  half  reliability  for 
fifth  grade  students  as  .91  (Woodcock,  1998). 

Passage  comprehension.  General  reading  comprehension  was 
measured  with  the  Passage  Comprehension  subtest  from  the 
Woodcock  Reading  Mastery  Tests-Revised/Normative  Update 
(Woodcock,  1998).  For  this  test,  children  are  asked  to  silently  read 
1  to  2  sentence  prompts  in  which  a  single  word  had  been  removed. 
Children  were  asked  to  provide  the  omitted  word.  Basal  and 
ceiling  rules  were  applied.  Split-half  reliability,  provided  by  the 
Technical  Manual,  is  .91  for  9-year  olds  and  .89  for  10-year  olds 
(Woodcock,  1998). 

Child  measures. 

Irregular  word  reading.  Irregular  word  reading  was  the  de¬ 
pendent  variable  of  interest.  Irregular  word  reading  ability  was 
assessed  using  an  experimental  word  list  developed  by  Adams  and 
Huggins  (1985).  The  list  contained  50  words  with  irregular 
spelling-to-sound  correspondences  that  varied  in  word  frequency. 
According  to  the  authors, 

the  frequencies  of  the  words  ranged  from  134.1  per  million  to  .12  per 
million  according  to  Carroll,  Davies,  and  Richman’s  (1971) 
dispersion-adjusted  norms  (U-scale).  The  50  words  were  selected 
from  a  set  of  80  words,  used  in  pilot  testing  with  80  children,  so  as  to 
exclude  words  that  were  of  inordinate  ease  or  difficulty  given  their 
frequency  and  words  that  appeared  to  be  beyond  the  children’s  lis¬ 
tening  or  speaking  vocabularies. 

Words  were  arranged  to  increase  in  difficulty  and  students 
attempted  to  read  all  words.  The  list  contained  words  that  prove 
very  difficult  to  recode  phonologically  because  of  the  complex 
orthographic  patterns  they  comprise  (e.g.,  ocean,  heights,  tongue, 
guitar,  recipe).  The  list  contained  both  exception  and  strange 
words  based  on  Seidenberg  et  al.  (1984).  The  list  was  originally 
developed  for  second  through  fifth  graders.  One  item  was  dropped 
from  the  analyses  due  to  an  administration  error.  Scores  ranged 
from  0  to  49  for  this  sample.  The  internal-consistency  reliability 
for  this  task  was  .94. 

Reader  group.  Children  were  classified  into  one  of  three  read¬ 
ing  groups  based  on  the  LTA  analyses  (see  above):  early  emerging 
RD  (early  emerging  RD-W  &  early  emerging  RD-CW),  late- 
emerging  RD  (late-emerging  RD-W  &  late-emerging  RD-CW), 
and  TD  children.  To  contrast  group  performance  two  dummy 
codes  were  created  comparing  the  early  emerging  RD  group  to  the 
late-emerging  RD  group  (designated  as  early  emerging  RD)  and 
comparing  the  TD  group  to  the  late-emerging  RD  group  (desig¬ 
nated  as  TD). 

Orthographic  choice.  The  orthographic  choice  task  in  this 
study  was  a  shortened  version  of  the  one  used  by  Olson,  Kliegl, 
Davidson,  and  Foltz  (1985).  Whereas  the  original  test  had  80 
items,  the  current  version  had  only  40  (only  the  odd-numbered 


items  from  the  original  test  were  retained).  Children  were  pre¬ 
sented  two  sheets  of  paper,  each  containing  two  columns  of  test 
items.  They  were  asked  to  circle  the  real  word  in  each  pair.  Each 
item  comprised  the  correctly  spelled  word  and  a  pseudohomo¬ 
phone  foil  (e.g.,  rain  and  rane).  Children  completed  and  received 
feedback  on  4  practice  items  prior  to  beginning  the  test.  The  total 
score  was  the  sum  of  the  correct  items.  The  minimum  score  in  this 
sample  was  21  and  the  maximum  score  was  40.  Coefficient  alpha 
for  our  sample  was  .76. 

Phonological  awareness.  Phonological  awareness  was  mea¬ 
sured  with  the  Elision  subtest  of  the  Comprehensive  Test  of 
Phonological  Processing  (CTOPP;  Wagner,  Torgesen,  &  Rashotte, 
1999).  In  this  test,  children  were  presented  a  word,  asked  to  repeat 
the  word,  and  then  asked  to  say  the  word  without  a  specified 
syllable  for  the  first  3  items  and  without  a  specified  phoneme  for 
the  remaining  17  items.  Items  are  ordered  by  increasing  difficulty, 
and  the  examiner  discontinued  administration  after  three  consec¬ 
utive  incorrect  items.  In  addition  to  the  20  test  items  (for  5  of 
which  examiners  provided  performance  feedback),  6  practice 
items  were  administered.  The  total  score  was  the  sum  of  correct 
items.  Scores  ranged  from  1  to  20  in  this  sample.  Coefficient  alpha 
provided  by  the  manual  for  age  10  was  .91  and  age  11  was  .86. 

Nonword  decoding.  The  Woodcock  Reading  Mastery  Test- 
Revised — Normative  Update:  Word  Attack  (Woodcock,  1998),  a 
norm  referenced  test,  evaluates  students’  ability  to  pronounce 
pseudowords  presented  in  list  form.  It  contains  45  nonsense  words 
ordered  from  easiest  to  most  difficult.  Students  were  asked  to  read 
(decode)  the  words  aloud,  one  at  a  time.  The  developer- 
recommended  basal  and  ceiling  rules  were  applied  to  minimize 
boredom  and  frustration.  Scores  ranged  from  0-39  in  this  sample. 
Split-half  reliability  exceeded  .90  for  the  sample. 

Rapid  automatic  naming.  Rapid  automatized  naming  was  as¬ 
sessed  using  the  Rapid  Letter  Naming  subtest  of  the  CTOPP 
(Wagner,  Torgesen,  &  Rashotte,  1999).  Two  versions  of  the  test 
were  given.  On  both,  six  letters  were  randomly  printed  in  four 
rows  of  nine  letters.  After  ensuring  each  child  could  identify  the 
letters,  the  child  was  asked  to  name  the  letters  as  fast  as  possible. 
The  total  score  was  the  number  of  seconds  it  took  the  child  to 
name  the  letters  on  both  tests.  Scores  ranged  from  21-80  in  this 
sample.  Test-retest  reliability  was  .72  for  children  of  ages  8-17 
years  per  the  test  manual. 

Vocabulary.  Receptive  vocabulary  was  measured  using  the 
Peabody  Picture  Vocabulary  Test  (PPVT;  Dunn  &  Dunn,  2007). 
Students  were  asked  to  select  one  of  four  pictures  when  presented 
with  a  spoken  word.  Scores  ranged  from  103  to  195  for  this 
sample.  The  alternate-forms  reliability  coefficient,  Cronbach’s  al¬ 
pha,  and  split  half  reliabilities  for  10  year  olds  are  .87,  .96,  and  .93, 
respectively,  as  per  the  manual. 

Book  Title  Questionnaire.  Print  exposure  was  measured  with 
the  Book  Title  Questionnaire  (Beall,  2011).  This  self-report  ques¬ 
tionnaire  was  adapted  from  Cunningham  and  Stanovich  (1991)  in 
two  ways:  books  that  were  made  into  movies  were  removed  and 
book  titles  were  updated  to  include  more  recent  popular  books. 
Fifty  real  book  titles  along  with  14  foil  titles  were  presented. 
Children  were  asked  simply  to  check  the  titles  that  they  knew  were 
real  books.  They  were  informed  that  some  of  the  titles  were  not 
real,  and  therefore  encouraged  not  to  guess  but  only  to  mark  the 
titles  they  knew  were  real  books.  The  total  score  was  the  number 
of  real  titles  checked  minus  the  number  of  foils  checked.  Sample- 
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based  coefficient  alpha  for  our  sample  was  .94  based  on  all  items 
(real  and  foil). 

Word  measures. 

Frequency.  The  metric  used  for  word  frequency  was  the  stan¬ 
dard  frequency  index  (SFI)  from  the  Educator’ s  Word  Frequency 
Guide  (Zeno,  Ivens,  Millard,  &  Duuvuri,  1995).  SFI  represents  a 
logarithmic  transformation  of  the  frequency  of  word  type  per 
million  tokens  within  a  corpus  of  more  than  60,000  samples  of 
texts  from  various  sources.  These  sources  range  from  textbooks  to 
popular  literature.  The  range  of  SFI  within  the  corpus  is  3.5  to 
88.3.  Words  in  our  sample  ranged  from  33  to  61.60. 

Imageability.  Imageability  is  a  word-specific  feature  referring 
to  the  ease  with  which  a  word  can  elicit  a  mental  image  in  the 
reader  '(Paivio,  Yuille,  &  Madigan,  1968).  Existing  tables  of  im¬ 
ageability  ratings  were  missing  many  of  the  words  on  the  Adams 
and  Huggins  (1985)  list  and  therefore  we  collected  our  own  data. 
These  data  were  collected  from  68  undergraduate  students  in  an 
introductory  special  education  class.  They  were  asked  to  rate  the 
difficulty  of  bringing  about  a  mental  image  for  the  49  words  on  the 
irregular  word  reading  list.  The  instructions  included  the  following 
prompt  from  the  original  paper  by  Paivio,  Yuille,  and  Madigan 
(1968): 

The  words  that  arouse  mental  images  most  readily  for  you  should  be 
given  a  rating  of  7;  words  that  arouse  images  with  the  greatest 
difficulty  or  not  at  all  should  be  rated  1 ;  words  that  are  intermediate 
in  ease  or  difficulty  of  imagery,  of  course,  should  be  rated  appropri¬ 
ately  between  the  two  extremes. 

Average  ratings  for  these  words  ranged  from  1.75-6.87.  Cron- 
bach’s  alpha  for  imageability  for  this  sample  was  .93. 

Orthographic  neighborhood  size.  To  account  for  the  ortho¬ 
graphic  similarity  of  the  target  words  to  other  words,  we  used 
orthographic  neighborhood  size  measured  by  Coltheart’s  N  (ON; 
Coltheart,  Davelaar,  Jonasson,  &  Besner,  1977).  This  metric  is  the 
number  of  words  that  can  be  produced  by  changing  a  letter  in  a 
word  of  the  same  length.  These  data  were  obtained  from  the 
English  Lexicon  Project  (Balota  et  al.,  2007). 

Spelling-to-pronunciation  transparency  rating.  To  address 
how  easy  it  was  for  students  to  arrive  at  the  correct  pronunciation 
of  each  irregular  word  by  applying  typical  decoding  rules,  we 
asked  expert  raters  to  rate  this  difficulty  on  a  6-point  scale.  Our 
expert  raters  were  professors  and  graduate  students  with  a  firm 
background  in  phonics  and  decoding.  We  collected  these  data  from 
22  expert  raters.  Experts  were  asked  to  rate  words  on  a  6-point 
scale  and  they  were  given  the  following  prompt: 

Below  you  will  find  a  list  of  irregular  words.  We  would  like  you  to 
pretend  that  the  letter  string  is  unfamiliar  to  you  and  apply  a  decoding 
strategy  to  the  letter  string  and  rate  the  ease  of  matching  your  recoded 
form  of  the  letter  string  to  the  actual  word  pronunciation.  Rate  the 
difficulty  of  making  the  match  between  recoded  form  and  pronunci¬ 
ation  on  a  scale  from  1  to  6,  with  1  being  very  easy  and  6  being  very 
difficult. 

Cronbach’s  alpha  for  the  spelling-to-pronunciation  transparency 
rating  for  this  sample  was  .84. 

Strange  versus  exception  words.  We  assessed  whether  the 
words  included  on  the  irregular  word  list  were  strange  or  exception 
words  according  to  the  criteria  outlined  by  Seidenberg,  Waters, 
Barnes,  &  Tanenhaus,  1984.  Words  were  coded  as  a  ‘1’  if  they 


were  strange  and  as  a  ‘0’  if  they  were  exception.  For  multisyllabic 
words,  Seidenberg  et  al.  criteria  were  applied  to  each  syllable 
separately.  The  authors  of  this  paper  coded  the  words  for  strange 
versus  exception.  There  were  20  strange  words  on  the  list  and  29 
exception  words. 

Word  length.  To  account  for  differences  in  word  length  across 
items  the  number  of  letters  in  each  irregular  word  was  used  as  a 
word-level  covariate. 

Procedure 

Test  examiners  were  graduate  research  assistants  who  had  been 
trained  on  tests  until  procedures  were  implemented  with  90% 
fidelity.  Most  students  were  tested  in  three  1  -hr  sessions,  although 
a  minority  were  tested  in  two  1 .5-hr  sessions  or  in  one  3-hr  session. 
All  tests  were  given  individually,  audio  recorded  for  reliability/ 
fidelity  purposes,  and  scored  by  the  original  examiner.  Children 
received  small  school-related  prizes  or  a  $5  gift  card  for  partici¬ 
pating  in  each  testing  session.  All  tests  were  double-scored  and 
double-entered;  discrepancies  were  resolved  by  a  third  examiner. 
Average  fidelity  of  test  administration  procedures  (based  on  a 
random  selection  of  20%  of  the  taped  assessment  sessions)  ex¬ 
ceeded  94%  for  all  tests.  Study  data  were  entered  and  managed 
using  REDCap  electronic  data  capture  tools  hosted  at  Vanderbilt 
University  (Harris  et  al.,  2009). 

Analyses 

A  series  of  crossed-random  effects  models  (DeBoeck,  2008; 
Van  den  Noortgate,  De  Boeck,  &  Meulders,  2003)  were  used  to 
answer  the  research  questions  outlined  above  because  these  mod¬ 
els  allowed  us  to  include  both  child-  and  word-level  predictors  in 
the  same  model  as  well  as  address  interactions  between  the  two. 
These  item  response  theory  based  models  are  cross-classification 
multilevel  models  that  allow  variance  to  be  partitioned  across  the 
person  and  item  level  and  allow  for  responses  to  be  predicted  by 
both  person  and  item  level  effects.  We  conducted  these  analyses 
using  Laplace  approximation  available  through  the  lmer  function 
(Bates  &  Maechler,  2009)  from  the  lme4  library  in  R  (R  Devel¬ 
opment  Core  Team,  2012).  The  analyses  included  170  children  and 
49  words.  For  these  models,  words  and  persons  are  assumed  to  be 
random  samples  from  a  population  of  words  and  a  population  of 
persons.  Because  words  are  not  nested  within  persons,  these  mod¬ 
els  are  not  strictly  hierarchical  models,  but  instead  cross-classified. 
Words  and  persons  are  on  the  same  level  and  crossed  in  the  design 
and  responses  are  nested  within  persons  and  within  words  (see 
Appendix  A,  Figure  1A).  Power  for  these  analyses  has  been 
addressed  through  simulation  studies  (see  Cho,  Partchev,  &  De 
Boeck,  2012).  Various  methods  for  examining  model  parameters 
indicate  little  difference  in  fixed  effect  estimates  across  methods 
with  precision  being  relatively  robust  to  sample  size  and  number  of 
items.  For  random  effects,  although  some  methods  (e.g.,  the  alter¬ 
nating  imputation  posterior  method)  may  present  larger  bias  when 
the  models  are  used  with  smaller  samples,  these  same  models  also 
tend  to  result  in  smaller  mean  standard  errors  (Cho  et  al.,  2012).  To 
address  power  within  our  own  sample  we  conducted  a  simulation 
study  to  determine  the  minimal  detectable  effect  size  defining 
power  at  .80  (a  =  .05).  Because  crossed-random  effects  models  do 
not  yield  traditional  effect  size  estimates,  our  simulation  estimated 
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Figure  1.  Interaction  between  reader  type  (TD  =  typically  developing; 
LERD  =  Late  Emerging  Reading  Difficulty)  and  word  imageability. 

the  minimal  R2  change  detectable  when  a  covariate  was  added  to 
the  model  to  predict  either  child  or  word  variance  and  then  this 
minimal  variance  change  was  converted  into  an  F2  statistic  which 
is  interpretable  using  guidelines  provided  by  Cohen  (1988).  Using 
this  method,  our  sample  of  words  and  children  allows  us  to  detect 
a  minimal  variance  change  on  words  equivalent  to  A8.26%  and  on 
children  equivalent  to  A3.27%.  These  reductions  in  variance  cor¬ 
respond  to  F2  statistics  of  .09  for  words  and  .03  for  children, 
representing  small  effects.  Therefore,  our  models,  with  a  sample 
size  of  170  children  and  49  items  (totaling  8,330  observations),  are 
powered  to  detect  small  effects  based  on  Cohen’s  criteria  for 
multiple  R2  (Cohen,  1988). 

The  crossed-random  effects  models  (see  Models  0-3  in  Table  7) 
were  built  gradually  in  a  stepwise  fashion  using  model  compari¬ 
sons  to  determine  the  model  that  best  fit  the  data.  The  uncondi¬ 
tional  model  (Model  0)  was  fit  first  by  adding  a  person-specific 
random  effect  (r010j)  and  an  item-specific  random  effect  (r020j) 
because  we  expected  random  variation  related  to  each  of  these 
variables.  The  binary  outcome  (p.f;  the  probability  of  a  correct 
response  from  person  j  on  item  i)  was  assumed  to  follow  the 
Bernoulli  distribution  and  random  effects  were  assumed  to  be 
normally  distributed.  We  used  an  unconditional  model  with  only 
random  effects  for  persons  and  items  to  determine  the  variability 
associated  with  persons  and  items.  The  next  model  (Model  1) 
contained  fixed  effects  for  TD  (y001)  and  early  emerging  RD 
(Y002)  with  the  late-emerging  RD  group  acting  as  the  referent 
group.  We  chose  late-emerging  RD  as  the  referent  group  because 
we  were  interested  in  comparing  them  to  the  TD  and  early  emerg¬ 
ing  RD  groups.  This  was  done  as  the  next  step  because  we 
expected  differences  across  groups.  Model  1  was  the  base  model 
for  all  subsequent  analyses  because  we  expected  group  differences 
to  account  for  a  large  amount  of  variance  attributable  to  the 
sampling  procedure  and  how  groups  were  established  a  priori.  This 
model  also  allowed  us  to  determine  how  well  the  subsequent 
models  explained  irregular  word  reading  variance.  For  Model  1 
and  all  subsequent  models,  we  determined  the  structure  of  random 
effects  (i.e.,  random  slopes)  using  a  three-step  process.  (A  random 
slope  here  refers  to  allowing  a  child  characteristic  to  vary  ran¬ 
domly  across  words  and  allowing  a  word  characteristic  to  vary 
randomly  across  children.)  First,  we  added  all  fixed  effects  to¬ 
gether.  Next,  we  added  random  slopes  for  each  fixed  effect  and 


completed  model  comparisons.  We  compared  this  model  with  the 
fixed  effects  model  using  a  likelihood  ratio  test  and  the  null 
hypothesis  that  the  more  parsimonious  model  best  fit  the  data. 
Finally,  we  estimated  a  model  based  on  the  results  of  the  first  two 
steps.  This  multistep  procedure  follows  Bates’  (201 1)  recommen¬ 
dations  and  allows  us  to  find  a  final  model  that  provides  the  best 
fit  for  the  data.  The  intercept  and  slope  random  effects  were 
assumed  to  have  zero  covariance. 

We  built  a  series  of  models  to  answer  our  research  questions  using 
Model  1  as  a  base.  Model  1  included  only  reader  group  and  addressed 
the  first  research  question  regarding  group  differences  in  irregular 
word  reading  skill.  Model  2  addressed  the  second  research  question, 
concerned  with  child-  and  word-level  covariates.  This  model  included 
child  reading-related  skills  (Yoo3_Yoo7)  anc*  word  characteristics 
(Yoo8_Yoi3)-  Model  3  was  an  interaction  model  including  hypothe¬ 
sized  interactions,  designed  to  answer  research  question  three.  In  this 
model,  we  added  interactions  for  reader  group  with  frequency,  im¬ 
ageability,  spelling-to-pronunciation  transparency  rating,  strange  ver¬ 
sus  irregular  words  (Y0i4~Yo2iX  as  well  as  interactions  for  frequency 
with  spelling-to-pronunciation  transparency  rating,  strange  versus  ex¬ 
ception,  and  imageability  (7022-7024)-  Details  about  the  structure  of 
the  data  for  these  analyses,  the  equations  for  the  models,  and  the 
covariates  included  for  each  model  are  provided  in  Appendix  B. 

We  examined  the  effect  of  each  word  level  and  child  level 
covariate  by  calculating  the  probability  of  a  correct  response  with 
the  addition  of  the  covariate  to  the  intercept,  following  the  formula 
Pji  —  1  +exp(  -y  )  >  with  v  representing  the  covariate  of  interest. 
The  late-emerging  RD  group  was  the  referent  group  and  predicted 
probabilities  are  given  for  an  average  item  and  an  average  child  in 
the  late-emerging  RD  group  where  all  other  covariates  are  at  their 
mean  values  for  our  sample.  We  calculated  the  variability  ex¬ 
plained  by  calculating  the  reduction  in  child  and  item  variance 
from  the  base  model  containing  only  the  TD  and  early  emerging 
RD  covariates.  Two  formulas  were  used,  one  for  the  child  level 

and  one  for  the  item  level.  These  formulas  were  — cl:  ■  crKI"moM ancj 

2  _  ?  ^K)l(base) 

®"r02(base)  D7o2(model  n)  .  •  1  1 

- 7 - ,  respectively,  where  n  represents  the  model  to 

D"r02(base) 

which  the  base  model  was  compared  (Bryk  &  Raudenbush,  1992). 
For  the  final  models  that  included  random  slopes  in  addition  to  the 
random  intercepts,  we  calculated  the  variance  explained  using  only 
the  fixed  effects  models,  a  method  supported  by  recent  simulations 
by  LaHuis,  Hartman,  Hakoyama,  and  Clark  (2014). 

Results 

Demographic  data  for  participants  (N  =  170)  are  presented  in 
Table  2.  There  were  more  females  than  males  in  the  sample,  and  10 
children  were  retained  (repeated  a  grade).  The  sample  represents  the 
demographics  of  the  local  district  with  respect  to  the  percent  of 
African  American  (48.82%  sample/47%  district)  and  Caucasian 
(37.06%  sample/35%  district)  children.  The  sample  had  a  lower 
percentage  of  Hispanic  children  compared  to  the  district  (2.94% 
sample/16%  district)  because  of  the  initial  sampling  requirement 
across  the  3  cohorts  that  children  enrolled  in  first  grade  English 
Language  Learner  instruction  be  eliminated  from  the  sample. 

Table  3  provides  the  child-level  performance  across  measures 
disaggregated  by  reader  group  with  associated  mean  comparisons 
(ANOVA  using  Bonferonni  post  hoc  pairwise  comparisons).  Except 
for  book  title  recognition  measures,  students  in  the  TD  group  outper- 
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Table  2 


Demographic  Statistics 


Variable 

n 

% 

Mean  ( SD ) 

Age  (years) 

10.77  (.451 

Gender 

Female 

91 

53.53 

Male 

79 

46.47 

Grade 

3 

1 

.59 

4 

9 

5.29 

5 

160 

94.12 

Group 

ERD 

19 

11.18 

LERD 

43 

25.29 

TD 

108 

63.53 

Race 

African  American 

83 

48.82 

Asian 

7 

4.12 

Caucasian 

63 

37.06 

Hispanic 

5 

2.94 

Kurdish 

5 

2.94 

Biracial 

5 

2.94 

Other 

2 

1.18 

Note.  N  =  170.  ERD  =  Early-identified  reading  difficulty;  LERD  = 
Late-emerging  reading  difficulty;  TD  =  Typically-developing. 


formed  both  students  in  the  late-emerging  RD  and  the  early  emerging 
RD  groups.  Students  in  the  late-emerging  RD  group  outperformed 
students  in  the  early  emerging  RD  group  on  orthographic  processing, 
phonological  awareness,  rapid  automatized  naming,  and  overall  irreg¬ 
ular  word  reading.  No  significant  differences  were  found  between 
early  emerging  RD  and  late-emerging  RD  groups  on  measures  of 
vocabulary  and  book  title  recognition.  The  table  also  provides  disag¬ 
gregated  performance  on  standardized  measures  for  word  identifica¬ 
tion,  passage  comprehension,  and  nonword  decoding,  with  a  consis¬ 
tent  ordering  effect  of  TD  >  late-emerging  RD  >  early  emerging  RD. 
Table  4  provides  the  descriptive  statistics  for  the  word  characteristics 
across  all  49  words  in  the  analyses.  The  average  word  frequency  using 
the  SFI  metric  was  50.51,  the  average  spelling-to-pronunciation  trans¬ 
parency  rating  across  all  49  words  was  3.59  on  the  6-point  scale,  the 
average  imageability  rating  was  5.82  (7-point  scale),  the  average  word 


Table  4 

Word  Level  Descriptive  Statistics  (N  —  49) 


Variable 

M  (SD) 

Min 

Max 

Frequency  (SFI) 

Spelling-to-pronunciation  transparency 

50.51  (6.84) 

33 

61.60 

rating 

3.59  (.78) 

1.95 

5.27 

Imageability 

5.82  (1.34) 

1.67 

6.88 

Length 

5.22  (1.04) 

4 

8 

Orthographic  neighborhood  size 

2.51  (3.10) 

0 

14 

length  was  5.22  letters,  and  the  average  orthographic  neighborhood 
size  was  2.5 1 . 

Table  5  provides  the  zero-order  correlations  among  the  child 
level  predictors  of  irregular  word  reading.  There  were  significant 
correlations  between  all  child  level  predictors  of  word  reading  with 
the  exception  of  the  Book  Title  Questionnaire  variable.  The  Book 
Title  Questionnaire  was  not  significantly  correlated  with  any  of  the 
other  child  level  predictors,  including  average  performance  on  the 
irregular  word  reading  task.  All  of  the  remaining  child-level  pre¬ 
dictors  were  highly  correlated  with  irregular  word  reading  with 
correlations  ranging  from  .47  (rapid  automatized  naming)  to  .85 
(nonword  decoding).  Table  6  provides  zero  order  correlations  for 
all  word  level  predictors.  In  this  table,  we  included  the  aggregated 
irregular  word  reading  accuracy  for  words  across  children.  Word- 
level  characteristics  most  highly  correlated  with  irregular  word 
reading  difficulty  were  frequency  (r  =  .71)  and  the  spelling-to- 
pronunciation  transparency  rating  (r  =  —.50).  There  were  also 
significant  correlations  among  the  word  level  predictors.  The 
spelling-to-pronunciation  transparency  rating  was  significantly 
correlated  with  frequency  (r  =  —.45),  with  more  difficult  words 
being  less  frequent.  Similarly,  words  that  were  strange  were  less 
frequent  (r  =  -.38)  and  had  higher  spelling-to-pronunciation 
transparency  ratings  (r  =  .38).  Length  and  imageability  were 
significantly  correlated  ( r  =  .35),  with  longer  words  typically 
being  more  imageable.  Orthographic  neighborhood  size  was  sig¬ 
nificantly  correlated  with  strange  (r  =  —.40)  and  length 
(r  =  —  .42),  meaning  that  strange  words  had  smaller  orthographic 
neighborhood  sizes  and  were  typically  shorter. 


Table  3 


Child  Level  Descriptive  Statistics  (N  =  170) 


Variable 

TD  ( n  = 

108) 

ERD  ( n 

=  19) 

LERD  ( n 

=  43) 

All  children 
(N  =  170) 

Pairwise  comparisons2 

M 

(SD) 

M 

(SD) 

M 

(SD) 

M 

(SD) 

PA 

13.84 

(4.49) 

6.58 

(3.92) 

9.19 

(3.56) 

11.85 

(5.00) 

TD  >  LERD  >  ERD 

OC 

36.20 

(2.34) 

26.79 

(4.42) 

33.40 

(3.14) 

34.44 

(4.11) 

TD  >  LERD  >  ERD 

RAN 

35.30 

(7.97) 

48.68 

(15.82) 

40.79 

(9.31) 

38.18 

(10.39) 

TD  >  LERD  >  ERD 

VOC 

158.48 

(17.36) 

130.21 

(21.69) 

139.07 

(19.46) 

150.41 

(21.34) 

TD  >  LERD  =  ERD 

BTQ 

12.06 

(6.97) 

16.47 

(10.22) 

11.84 

(8.38) 

12.49 

(7.83) 

TD  =  LERD  =  ERD 

IRWR 

34.78 

(5.95) 

9.42 

(9.74) 

21.09 

(5.19) 

28.48 

(10.92) 

TD  >  LERD  >  ERD 

WJ-WIDb 

101.11 

(7.56) 

75.89 

(12.50) 

87.02 

(3.43) 

94.73 

(11.69) 

TD  >  LERD  >  ERD 

WJ-WAb 

106.10 

(8.23) 

78.95 

(15.94) 

90.47 

(5.68) 

99.11 

(13.19) 

TD  >  LERD  >  ERD 

WJ-PCb 

100.18 

(7.26) 

70.63 

(13.54) 

87.16 

(8.23) 

93.58 

(12.92) 

TD  >  LERD  >  ERD 

Note.  OC  =  Orthographic  choice;  PA  =  Phonological  Awareness;  RAN  =  Rapid  Automatized  Naming;  VOC  =  Vocabulary  (PPVT);  BTQ  =  Book  title 
questionnaire;  IRWR  =  Irregular  Word  reading;  WJ  =  Woodcock  Johnson;  WID  =  Word  Identification;  PC  =  Passage  Comprehension;  WA  =  Word 
Attack. 

a  Mean  comparisons  were  conducted  using  ANOVA  with  Bonferonni  post  hoc  pairwise  comparisons.  b  Standard  scores  (M  =  100,  SD  -  15). 
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Table  5 

Zero  Order  Correlations  Between  Child  Variables 


Variable 

1 

2 

3 

4 

5 

6  7 

1.  Irregular  word 

reading 

— 

2.  PA 

.60 

— 

3.  RAN 

-.47 

-.39 

— 

4.  OC 

.79 

.40 

-.43 

— 

5.  VOC 

.65 

.46 

-.27 

.37 

— 

6.  BTQ 

-.11 

-.13 

-.04 

-.06 

-.07 

— 

7.  NWD 

.85 

.59 

.51 

.71 

.49 

-09  — 

Note,  p  <  .001  for  all  variables  in  bold.  PA  =  Phonological  Awareness; 
RAN  =  Rapid  Automatized  Naming;  OC  =  Orthographic  choice;  VOC  = 
Vocabulary  (PPVT);  BTQ  =  Book  Title  Questionnaire;  NWD  =  Nonword 
Decoding. 


Child-  and  Word-Level  Characteristics  Related  to 
Irregular  Word  Recognition 

The  unconditional  model  (Model  0)  containing  only  person  and 
item  random  effects  had  an  intercept  logit  estimate  of  y000  =  0.64, 
corresponding  to  a  predicted  probability  of  a  correct  irregular  word 
reading  response  of  .66  for  the  average  child  and  the  average  item 
(see  Table  7).  There  was  variability  around  that  average  for  both 
children  (cr2r010.  =  4.58)  and  items  (ar2r020i  =  5.87). 

To  address  whether  differences  exist  in  irregular  word  reading 
between  classes  of  children  identified  as  late-emerging  RD  and  those 
identified  as  early  emerging  RD  and  TD  we  added  the  two  dummy 
variables  representing  the  late-emerging  RD-TD  comparison  (TD) 
and  the  late-emerging  RD-early  emerging  RD  comparison  (ERD). 
Because  we  expected  there  to  be  differences  between  reader  groups, 
this  model  was  completed  first  and  acted  as  the  base  model  to  which 
all  subsequent  models  were  compared.  Results  of  the  reader  group 
model  (Model  1)  are  displayed  in  Table  7.  The  model  with  TD  (7001) 
and  ERD  (y()02)  fixed  effects  improved  model  fit  over  the  uncondi¬ 
tional  model  as  expected,  Ay2  =  171.41,  p  <  .0001.  In  an  effort  to 
construct  the  best  fitting  model,  an  iterative  process  was  used  in  which 
random  slopes  for  TD  (r001()  and  ERD  (r002l)  were  then  added  to  the 
model.  This  model  fit  the  data  significantly  better  than  the  previous 
model,  Ax!  =  19.44,  p  <  .0001.  Decisions  about  the  inclusion  of 
these  random  slopes  were  based  on  the  mixed  chi-square  distribution 
LRT  (Stram  &  Lee,  1994).  The  variance  associated  with  TD  and  ERD 
indicated  that  only  TD  required  a  random  slope.  A  comparison  of  the 
two  models  was  conducted  and  the  model  with  a  random  slope  for 
ERD  did  not  significantly  improve  model  fit  (Ax?  =  .20,  p  =  .65)  and 
therefore  the  most  parsimonious  model  was  used.  This  model  was 
used  as  the  base  model  for  all  subsequent  model  comparisons.  The 
Model  1  person  random  effect  variance  was  1 .46  and  was  the  variance 
against  which  the  subsequent  models  were  compared.  The  intercept 
(Yooo  =  —  0.697)  for  Model  1  indicated  a  mean  probability  of  a 
correct  response  of  .33  for  the  average  LERD  child,  .04  for  the 
average  ERD  child,  and  .86  for  the  average  TD  child.  Fixed  effects 
suggest  that  when  only  reader  group  is  entered  into  the  model  the 
probability  of  correctly  reading  an  irregular  word  was  greater  for  TD 
compared  with  LERD  as  well  as  greater  for  LERD  compared  with 
ERD. 

Our  primary  research  question  concerned  the  degree  to  which  child 
reading-related  skills  and  word  characteristics  covariates  predicted 


irregular  word  reading,  controlling  for  reader  group.  To  address  this 
question,  we  used  a  child  and  word  covariate  model  (Model  2)  with 
results  provided  in  Table  7.  When  we  constructed  Model  2,  main 
effects  for  all  child  and  word  level  covariates  were  first  added  to 
Model  1  simultaneously.  This  model  fit  the  data  significantly  better 
than  the  reader  group  model,  Ax?2  =  220.72,  p  <  .0001.  Using  an 
iterative  model  building  process,  a  random  slope  for  each  of  the 
covariates  was  then  added  to  the  model.  The  best  fitting  model 
contained  a  random  item  slope  for  TD  and  a  random  item  slope  for 
nonword  decoding.  Adding  a  random  slope  across  words  for  nonword 
decoding  significantly  improved  model  fit  over  the  model  with  only 
main  effects,  Axi  =  14.31,  p  <  .001.  The  intercept  (7(XX)  =  .081) 
indicated  a  mean  probability  of  a  correct  response  of  .55  for  the 
average  LERD  child,  all  other  things  being  equal.  Model  2  explained 
74%  of  the  person  variance  present  in  Model  1  and  60%  of  the  item 
variance.  Note  that  the  main  effects  reported  here  are  conditional  on 
the  other  covariates  in  the  model. 

For  child  characteristics,  we  found  significant  effects  for  three 
covariates  (orthographic  choice,  vocabulary,  and  nonword  decod¬ 
ing).  Below,  we  list  probabilities  for  each  covariate  for  an  average 
late-emerging  RD  child.  For  orthographic  choice  (y003  =  0.186), 
an  orthographic  choice  score  1  SD  above  the  sample  mean  corre¬ 
sponds  to  a  .73  probability  of  a  correct  response,  whereas  an 
orthographic  choice  score  1  SD  below  the  mean  would  correspond 
to  a  .37  probability  of  a  correct  response.  For  vocabulary  (y006  = 
0.027),  a  PPVT  score  1  SD  above  the  mean  corresponded  to  a  .67 
probability  of  a  correct  response,  compared  with  .41  for  1  SD 
below  the  mean.  Finally,  for  nonword  decoding  (y008  =  0.073),  a 
score  1  SD  above  the  mean  increased  the  probability  of  a  correct 
response  to  .71  whereas  a  score  1  SD  below  the  mean  decreased 
the  probability  to  .38. 

For  word  characteristics,  we  found  a  significant  effect  of  frequency, 
Too  12  =  0.268,  indicating  that  the  probability  of  correct  response  for 
an  average  late-emerging  RD  child  on  a  word  with  frequency  1  SD 
above  the  sample  mean  was  .88,  whereas  a  word  with  frequency  1  SD 
below  the  mean  was  .17.  For  the  spelling-to-pronunciation  transpar¬ 
ency  rating,  the  effect  was  marginally  significant  in  the  main  effect 
model  (z  =  1.92)  and  significant  in  the  interaction  model.  Therefore, 
we  interpret  that  effect  here.  The  probability  of  a  correct  response  for 
an  average  late-emerging  RD  child  on  a  word  with  a  rating  1  SD 
above  the  sample  mean  (meaning  rated  as  more  difficult)  was  .42 
whereas  a  word  with  a  rating  1  SD  below  the  sample  mean  was  .67. 


Table  6 

Zero  Order  Correlations  Between  Word  Variables 


Variable 

1 

2 

3 

4 

5  6  7 

1 .  Irregular  word  reading 

_ 

$ 

2.  Frequency 

3.  Spelling-to-pronunciation 

.71 

— 

transparency  rating 

-.50 

-.45 

— 

4.  Strange  vs.  exception 

-.24 

-.38 

.38 

— 

5.  Imageability 

-.11 

-.21 

.09 

.02 

_ 

6.  Length 

7.  Orthographic 

-.12 

-.27 

.21 

.02 

.35  — 

neighborhood  size 

.16 

.24 

-.19 

-.40 

-.23  -.42  — 

Note,  p  <  .001  for  all  variables  in  bold. 


EXPLORING  INDIVIDUAL  DIFFERENCES  IN  IRREGULAR  WORD 


61 


Table  7 


Fixed  Effects  and  Variance  Estimates  for  Irregular  Word  Reading 


Model  0  Unconditional 

Model  2  Person  and  item 

Model  3  Interaction 

model 

Model  1  Group  model 

model 

model 

Fixed  effects  parameter 

Est. 

(SE) 

z 

Est. 

(SE) 

z  Est. 

(SE) 

z 

Est. 

(SE) 

z 

Intercept  (y000) 

Group 

.638 

(.385) 

1.658 

-.697 

(.411)  - 

1 .694  .206 

(.359) 

.209 

.128 

(.375) 

.342 

7ooi  ERD 

— 

— 

-  — 

■2.570 

(.375)  - 

6.845  -.566 

(.266) 

2.125 

-.340 

(.316) 

1.075 

7oo2  TD 

Child  covariates 

— 

— 

2.540 

(.257) 

9.893  .453 

(.199) 

2.274 

.405 

(.219) 

1.849 

7oo3  PA 

— 

— 

— 

— 

— 

.028 

(.015) 

1.820 

.028 

(.015) 

1.826 

7oo4  DC 

— 

— 

— 

— 

— 

—  .186 

(.023) 

7.740 

.186 

(.024) 

7.765 

7 oos  RAN 

— 

— 

— 

— 

— 

-.005 

(.007) 

.756 

-.005 

(.007) 

.764 

7 oo6  Vocabulary 

— 

— 

— 

— 

— 

—  .027 

(.003) 

7.849 

.027 

(.003) 

7.841 

7oo7  BTQ 

— 

— 

— 

— 

— 

.001 

(.008) 

.013 

.001 

(.008) 

.002 

7 oos  Nonword  Decoding 

Word  covariates 

— 

— 

— 

— 

— 

.073 

(.013) 

5.770 

.073 

(.013) 

5.749 

7 009  Frequency 

y0l0  Spelling  to  pronunciation 

— 

— 

— 

— 

— 

.268 

(.042) 

6.454 

.335 

(.054) 

6.240 

rating 

— 

— 

— 

— 

— 

—  -.686 

(.357) 

1.920 

-.772 

(.380) 

2.028 

■y01 ,  Strange  vs.  Exception 

— 

— 

— 

— 

— 

.417 

(.585) 

.714 

.391 

(.578) 

.676 

y012  Imageability 

— 

— 

— 

— 

— 

—  .129 

(.189) 

.683 

.034 

(.189) 

.179 

7oi3  Length 

— 

— 

— 

— 

— 

—  .241 

(.270) 

.893 

.195 

(.263) 

.743 

7oi4  ON 

Interactions 

— 

— 

— 

— 

— 

.011 

(.092) 

.116 

.002 

(.094) 

.020 

y015  ERD  X  Frequency 

— 

— 

— 

— 

— 

-  - 

— 

— 

-.019 

(.030) 

.621 

7016  TD  X  Frequency 

— 

— 

— 

— 

— 

-  - 

— 

— 

-.027 

(.022) 

1.211 

7017  ERD  X  Imageability 

— 

— 

— 

— 

— 

-  - 

— 

— 

.204 

(.116) 

1.752 

7oi8  TD  X  Imageability 

— 

— 

— 

— 

— 

-  - 

— 

— 

.201 

(.084) 

2.382 

y019  ERD  X  Rating 

— 

— 

— 

— 

— 

-  - 

— 

— 

-.158 

(.244) 

.649 

7o2o  td  x  Rating 

— 

— 

— 

— 

— 

-  - 

— 

— 

.081 

(.165) 

.489 

7021  ERD  X  Strange 

— 

— 

— 

— 

— 

—  — 

— 

— 

-.645 

(.327) 

1.970 

7022  TD  X  Strange 

— 

— 

— 

— 

— 

—  — 

— 

— 

.150 

(.228) 

.658 

7023  Freq.  X  Rating 

— 

— 

— 

— 

— 

—  — 

— 

— 

.060 

(.057) 

1.057 

7024  Freq.  X  Strange 

— 

— 

— 

— 

— 

—  — 

— 

— 

-.154 

(.085) 

1.804 

7025  Freq.  X  Imageability 

— 

— 

— 

— 

— 

—  — 

— 

— 

.006 

(.026) 

.233 

Random  effects  Variance 

%  var  explained 

Variance  %  var  explained 

Variance  % 

var  explained 

Variance 

%  var  explained 

Intercepts 

Person  4.578 

— 

1.517 

— 

.392 

74% 

.391 

74% 

Item  5.870 

— 

6.292 

— 

2.498 

60% 

2.295 

63% 

Person  slopes 

N/A  — 

Item  Slopes 

— 

— 

— 

— 

— 

— 

— 

TD  — 

— 

.358 

— 

.135 

— 

.032 

— 

Nonword  decoding  — 

— 

— 

— 

.001 

— 

.001 

— 

Deviance 

6185 

5994 

5759 

5736 

Note.  ERD  =  Early  reading  difficulty;  TD  =  Typical  development;  PA  =  Phonological  awareness;  OC  =  Orthographic  choice;  RAN  =  Rapid  automatic 
naming;  BTQ  =  Book  Title  Questionnaire;  ON  =  Orthographic  neighborhood  size;  Freq.  =  Frequency,  p  <  .05  for  variables  in  bold. 


Irregular  Word  Recognition  in  Students  With 
Late-Emerging  RD 

The  third  research  question  concerned  interactions  between  child- 
and  word-level  predictors  in  explaining  irregular  word  reading  item 
variance.  This  exploratory  model  (Model  3)  estimated  1 1  interaction 
terms  to  answer  these  questions.  This  model  explained  74%  of  the 
person  level  variance  and  63%  of  the  item  level  variance.  There  were 
two  significant  child  by  word  interactions,  TD  status  by  imageability 
and  ERD  status  by  strange  versus  exception  word  class.  The  first 
significant  interaction,  TD  status  by  imageability  (y017  =  0.201), 
indicates  that  students  in  the  TD  group  significantly  outperformed 
students  in  the  late-emerging  RD  group  on  words  that  were  highly 


imageable  but  not  on  words  that  were  low  on  the  Imageability  Scale 
(see  Figure  1).  The  second  significant  interaction,  ERD  status  by 
strange  word  class  (y02l  =  —.645),  indicated  that  students  in  the 
late-emerging  RD  group  had  a  significantly  higher  probability  of 
reading  a  word  correctly  if  it  was  classified  as  strange  than  did  a  child 
in  the  early  emerging  RD  group.  This  pattern  was  not  true  for  words 
that  were  classified  as  exception  words  (see  Figure  2). 

Discussion 

In  this  study,  we  argue  that  irregular  words  present  unique  chal¬ 
lenges  for  developing  readers  (Waters  et  al.,  1984;  Taylor  et  al.,  2011; 
Wang  et  al.,  2013).  Using  a  self-teaching  mechanism  that  builds  the 


62 


STEACY  ET  AL. 


■  ERD 

■  LERD 


Figure  2.  Interaction  between  reader  group  and  word  categorization  as 
strange  or  exception. 


orthographic  lexicon  through  item-specific  learning,  simply  applying 
GPC  rules  to  irregular  words  will  not  result  in  the  correct  pronunci¬ 
ation  (a  phenomenon  referred  to  as  partial  decoding),  and  this  likely 
disrupts  or  slows  the  addition  of  items  to  the  orthographic  lexicon. 
This  has  led  many  to  suggest  that  the  addition  of  irregular  words  to  the 
orthographic  lexicon  in  English  is  modulated  by  child-  and  word-level 
factors  beyond  phonologically  based  recoding  skill  and  print  experi¬ 
ence  (Nation  &  Snowling,  1998;  Plaut  et  al„  1996;  Wang  et  al.,  2013; 
Waters  et  al.,  1985;  Waters  et  al.,  1984).  We  maintain  that  compre¬ 
hensive  models  of  irregular  word  reading  are  needed  to  assess  which 
child-level  variables,  word-level  variables,  and  child  by  word  inter¬ 
actions  are  associated  with  item-level  variance.  Furthermore,  we 
assert  that  models  such  as  the  one  developed  in  this  study  are  neces¬ 
sary  to  begin  the  search  for  potentially  malleable  factors  that  can 
improve  the  ability  of  children,  with  particular  attention  to  those  with 
RD,  to  recognize  irregular  words.  Overall,  results  suggest  that  there 
are  multiple  sources  that  explain  irregular  word  reading  variance 
including  child-level  characteristics,  word-level  characteristics,  and 
interactions  between  child-  and  word-level  characteristics.  The  fol¬ 
lowing  sections  discuss  results  within  the  context  of  our  three  research 
questions  with  a  focus  on  a  model  of  irregular  word  reading  in  which 
lexical  characteristics  specific  to  both  child  and  word  influence  accu¬ 
racy. 

Child-  and  Word-Level  Characteristics  Related  to 
Irregular  Word  Reading 

We  found  several  significant  predictors  at  the  child  level.  The 
first  predictor  was  reader  group,  with  a  significant  difference  in 
irregular  word  reading  identified  between  TD  students  and 
those  with  late-emerging  RD  and  between  late-emerging  RD 
and  early  emerging  RD  students  (Model  1).  These  group  dif¬ 
ferences  remained  after  child-  and  word-level  predictors  were 
entered  into  the  model  (Model  2),  but  were  no  longer  significant 
once  the  interaction  terms  were  added  to  the  child-  and  word- 
level  predictors.  Overall,  results  indicate  that  the  general  word 
reading  difficulties  observed  in  the  late-emerging  RD  and  early 
emerging  RD  groups  also  impact  their  irregular  word  reading 
skill,  but  once  controlling  for  important  child-level  factors  (e.g., 
decoding,  orthographic  knowledge,  vocabulary)  along  with 
child  by  word  interactions,  differences  between  all  three  groups 
on  irregular  word  reading  skill  were  minimized.  Thus,  differ¬ 
ences  between  early  emerging  RD  and  late-emerging  RD 
groups  on  irregular  word  reading  appear  explainable  by  the 


lower  performance  of  the  early  emerging  RD  group  on  impor¬ 
tant  cognitive  skills  associated  with  reading  performance. 

Cognitive  variables  that  were  significantly  related  to  irregu¬ 
lar  word  reading  within  the  entire  sample  of  children  included 
nonword  decoding,  vocabulary,  and  orthographic  knowledge. 
Given  the  overwhelming  evidence  for  the  correlation  between 
decoding  skills  and  word  reading,  a  relationship  with  irregular 
word  reading  was  not  unexpected.  The  significant  role  of  de¬ 
coding  on  irregular  word  reading  is  consistent  with  speculation 
by  Griffiths  and  Snowling  (2002)  that  the  ability  to  read  irreg¬ 
ular  words  is  at  least  partially  dependent  on  having  access  to 
“segmental  phonological  representations”  (p.  41).  A  second 
significant  predictor  was  orthographic  knowledge  measured 
using  the  orthographic  choice  task.  Students  with  greater  or¬ 
thographic  knowledge  had  a  higher  probability  of  reading  ir¬ 
regular  words  correctly.  This  finding  is  consistent  with  others 
who  have  found  that  orthographic  knowledge  is  related  to 
general  word  reading  skill  (e.g.,  Cunningham,  Perry,  &  Stanov- 
ich,  2001).  There  was  a  similar  pattern  for  vocabulary;  students 
with  a  higher  receptive  vocabulary  score  had  a  higher  proba¬ 
bility  of  successfully  recognizing  irregular  words,  even  when 
controlling  for  reader  group,  decoding,  OC,  and  RAN.  This 
pattern  was  consistent  across  Models  2  and  3,  and  supports  our 
hypothesis,  and  a  developmental  word-reading  model,  in  which 
orthographic-to-phonological  pathways  necessary  to  establish 
irregular  word  representations  may  be  at  least  partially  depen¬ 
dent  on  lexical  input  (Nation  &  Snowling,  1998;  Plaut  et  al., 
1996;  Ricketts  et  al.,  2007;  Tunmer  &  Chapman,  2012).  It  is 
noteworthy  that  we  did  not  find  reading  experience,  as  mea¬ 
sured  by  the  BTQ,  to  be  a  significant  predictor  of  irregular  word 
reading.  The  nonsignificant  correlation  between  irregular  word 
reading  and  the  BTQ  suggests  that  it  was  not  a  sensitive 
measure.  We  interpret  these  results  as  a  measurement  issue  with 
the  BTQ  and  do  not  suggest  that  print  exposure  is  unimportant 
for  the  acquisition  of  irregular  words  (for  a  discussion  of  the 
measurement  issues  related  to  the  BTQ  and  print  exposure  see 
Foorman,  1994),  and  instead  we  speculate  that  the  titles  con¬ 
tained  on  the  list  did  not  capture  important  individual  differ¬ 
ences  in  print  exposure. 

At  the  word  level,  we  found  the  expected  main  effect  for  frequency. 
We  also  found  that  our  expert  rating  of  spelling-to-pronunciation 
difficulty  was  marginally  significant  in  the  main  effect  model  and  was 
significant  in  the  interaction  model.  This  finding  is  consistent  with  our 
hypothesis  that  the  ease  of  phonological  recoding  from  an  ortho¬ 
graphic  stimulus  is  a  good  predictor  of  success  in  reading  an  irregular 
word  correctly.  This  pattern  seems  to  suggest  that  as  the  phonological 
output  from  decoding  more  closely  represents  the  actual  lexical  pho¬ 
nological  representation,  the  higher  the  probability  of  a  correct  re¬ 
sponse  (Elbro  et  al.,  2012;  Tunmer,  &  Chapman,  2012;  Venezky, 
1999).  Contrary  to  suggestions  by  Seidenberg  et  al.  (1984),  we  did  not 
find  a  significant  main  effect  for  the  irregular  word  type  (i.e.,  excep¬ 
tion  vs.  strange)  after  controlling  for  other  word-level  features,  sug¬ 
gesting  that  these  two  classes  of  words  may  be  processed  similarly  for 
the  entire  sample,  however  differences  between  late-emerging  RD 
and  early  emerging  RD  groups  were  present.  It  may  also  be  that  the 
rating  of  spelling-to-pronunciation  is  a  more  precise  way  to  charac¬ 
terize  the  degree  of  irregularity  than  dichotomous  coding  of  exception 
or  strange. 
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Irregular  Word  Reading  in  Students  With 
Late-Emerging  RD 

The  value  of  distinguishing  between  students  who  are  TD  and  those 
with  early  emerging  RD  and  late-emerging  RD  was  also  of  interest  in 
this  study.  The  use  of  item-based  random  effects  models  allowed  us 
to  probe  for  interactions  between  reader  group  and  word-level  char¬ 
acteristics.  We  were  particularly  interested  in  whether  children  with 
late-emerging  RD  differed  in  how  they  process  irregular  words  com¬ 
pared  with  TD  children  and  children  with  early  emerging  RD.  One  of 
the  prevailing  hypotheses  in  the  field  is  that  the  reading  difficulties 
associated  with  late-emerging  RD  may  be  attributable  to  the  increas¬ 
ing  demands  placed  on  the  reader  at  the  word  level,  as  students  are 
faced  with  new  words  many  of  which  are  lower  in  frequency  and 
contain  irregular  spelling  patterns,  and  thus  require  the  formation  of 
complex  connections  between  phonology,  orthography,  and  seman¬ 
tics  (Catts  et  al.;  Leach  et  al.,  2003). 

Two  interactions  emerged  between  word-level  characteristics  and 
reader  group  that  tend  to  support  the  view  that  lexical  feedback  may 
be  somewhat  limited  during  irregular  word  learning  in  children  with 
late-emerging  RD.  The  first  involved  an  interaction  between  the  TD 
and  late-emerging  RD  groups  and  imageability  in  which  increases  in 
word  imageability  were  associated  with  greater  irregular  word  reading 
skill  in  TD,  but  not  late-emerging  RD  children.  This  relationship  is  in 
the  opposite  direction  typically  reported  in  the  literature  in  which  less 
skilled  readers  benefit  more  from  words  which  are  higher  in  image- 
ability  (Coltheart,  Laxon,  &  Keating,  1988;  Steacy  et  al.,  2013). 
However,  it  should  be  noted  that  the  interaction  between  early  emerg¬ 
ing  RD  and  late-emerging  RD  on  imageability  was  approaching 
significance.  Results  suggest  that  after  controlling  for  everything  else 
in  the  model,  children  with  late-emerging  RD  may  have  specific 
difficulties  exploiting  words  with  higher  imageability.  Without  know¬ 
ing  whether  late-emerging  RDs  are  familiar  with  each  of  the  words, 
we  can  only  speculate  that  the  students  in  the  late-emerging  RD  may 
not  activate  and  use  lexical  properties  such  as  imageability  in  the  same 
way  as  the  TD  students  and  that  this  makes  it  unlikely  that  they  will 
retrieve  images  associated  with  the  word  to  aid  in  the  formation  of  an 
orthographic  representation.  Furthermore,  we  interpret  our  results  to 
suggest  that  having  a  word  in  the  semantic  lexicon  that  is  highly 
imageable  may  increase  the  probability  that  a  student  will  accurately 
identify  an  irregular  word  for  students  who  are  typically  developing 
and  students  in  the  early  emerging  RD  group.  Future  studies  would 
benefit  from  assessing  item-level  semantics,  lexical  phonology,  and 
imageability  to  allow  the  teasing  apart  of  these  influences  on  irregular 
word  reading  in  children  with  late-emerging  RD.  It  would  also  be 
quite  helpful  to  use  the  set  for  variability  paradigm  as  an  item-level 
variable  to  assess  whether  individual  differences  in  children’s  ability 
to  bridge  the  gap  between  the  regularized  decoding  pronunciation  and 
the  actual  phonological  representation  is  an  important  predictor  of 
item  level  variance. 

The  second  interaction  involved  early  emerging  RD  and  late- 
emerging  RD  groups  and  word-level  orthographic  complexity  (i.e., 
exception  vs.  strange  words).  There  was  little  difference  between  the 
groups  on  exception  word  reading  but  significant  differences  on 
strange  word  reading  favoring  the  late-emerging  RD  group.  Results 
do  not  seem  to  point  to  specific  problems  in  the  formation  of  complex 
connections  between  phonology  and  orthography  in  the  late-emerging 
RD  group  as  the  interaction  between  late-emerging  RD  and  TD 
groups  and  orthographic  complexity  was  not  significant.  Instead, 


results  seem  to  favor  a  model  in  which  the  early  emerging  RD  group 
has  greater  problems  forming  the  complex  connections  between  pho¬ 
nology  and  orthography  that  are  contained  in  strange  words.  Seiden- 
berg  et  al.  (1984)  have  offered  that  strange  words  are  processed 
differently  than  exception  words.  It  is  the  case  that  the  early  emerging 
RD  group  has  greater  deficits  in  phonological  processing  compared  to 
the  late-emerging  RD  group  and  this  may  make  it  more  difficult  for 
early  emerging  RD  children  to  form  precise  representations  between 
the  orthography  and  phonology  represented  in  strange  words.  Overall, 
the  interactions  seem  to  suggest  that  lexical  differences  (e.g.,  semantic 
and  lexical  phonology)  better  explain  differences  between  late- 
emerging  RD  and  TD  children  as  compared  to  more  traditional 
phonological-orthographic  processing  skills.  However,  we  regard 
these  interaction  effects  to  be  exploratory  in  nature  because  of  the 
relatively  small  sample  size,  particularly  the  early  emerging  RD 
group.  Differences  in  results  between  our  study  and  those  reported  by 
Griffiths  and  Snowling  (2002)  are  likely  attributable  to  the  inclusion 
of  both  RD  and  TD  children,  the  use  of  both  child-  and  word-level 
characteristics  as  predictors,  and  the  added  power  garnered  by  item- 
level  analyses. 

Limitations 

There  are  several  limitations  that  should  be  considered  in  rela¬ 
tion  to  this  study.  The  first  is  the  sample  sizes  of  both  the  students 
and  the  words.  Although  we  note  that  we  should  have  adequate 
power  to  detect  an  effect  across  the  49  words  and  170  children, 
readers  should  exercise  caution  when  generalizing  these  results  to 
other  students  and  words.  Furthermore,  an  item-specific  measure 
of  familiarity  would  be  useful  for  these  analyses  to  control  for 
whether  students  have  phonological  representations  of  these  words 
in  their  lexicons.  Finally,  a  parallel  measure  of  regular  words 
sampled  in  the  same  way  as  the  Adams  and  Huggins  (1985) 
irregular  word  list  used  in  this  study  would  also  be  informative  and 
provide  a  contrast  between  regular  and  irregular  words. 

Conclusion 

In  conclusion,  this  study  has  identified  multiple  sources  affecting 
individual  differences  in  irregular  word  reading  among  5th  grade 
children  oversampled  for  RD.  Results  indicate  that  variance  in  irreg¬ 
ular  word  reading  was  predicted  at  the  child  level  by  decoding  skill, 
orthographic  coding,  and  vocabulary;  at  the  word  level  by  word 
frequency  and  a  spelling-to-pronunciation  transparency  rating;  and  by 
the  Reader  group  X  Imageability  and  Reader  group  X  Orthographic 
complexity  interactions.  For  irregular  words,  partial  decoding  likely 
requires  having  item-specific  knowledge  (either  semantic  and/or  lex¬ 
ical  phonology)  to  help  to  fill  voids  resulting  from  the  mismatch 
between  orthographic-phonology  mapping  and  lexical  phonology  in 
emergent  readers  (Keenan  &  Betjemann,  2008).  Our  results  support  a 
developmental  model  of  word  reading  in  which  orthographic-to- 
phonological  pathways  become  at  least  partially  dependent  on  lexical 
input  (input  of  semantic  knowledge),  with  this  influence  being  in¬ 
creasingly  important  for  irregular  words  (Nation  &  Snowling,  1998; 
Ricketts  et  al.,  2007;  Tunmer  &  Chapman,  2012).  Overall,  results 
from  our  study  support  the  role  of  lexical  influence  on  irregular  word 
reading  with  vocabulary  skill  and  spelling-to-pronunciation  ease  hav¬ 
ing  a  main  effect  and  word  imageability  acting  as  a  moderator.  Thus 
we  concluded  that  lexical  representations  appear  to  be  important  in 
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irregular  word  reading  and  advocate  for  further  work  examining  the 
role  of  lexical  processing  on  irregular  word  reading  in  children  with 
RD.  Although  the  design  of  this  study  does  not  allow  for  causal 
inferences,  allowing  word-  and  child-attributes  to  compete  for  vari¬ 
ance  in  the  same  model  provides  an  opportunity  to  consider  new,  and 
possibly  untested,  approaches  to  effectively  teach  irregular  word 
reading  skills  to  struggling  readers  (Compton,  Miller,  Elleman,  & 
Steacy,  2014).  Specifically,  we  encourage  future  studies  examining 
whether  item-level  training  of  semantics  and  lexical  phonology  can 
increase  the  speed  at  with  children  with  RD  add  irregular  words  to 
their  orthographic  lexicons. 
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Appendix  A 

Sampling  Plan  for  Each  of  the  Cohorts 


At  the  end  of  fourth  grade  single  indicator  hidden  Markov 
models  were  fit  separately  for  the  four  time  points  representing 
word  reading  and  reading  comprehension  development.  These 
models  are  considered  a  first-order  Markov  process  where  the 
transition  matrices  are  specified  to  be  equal  over  time  (i.e.,  mea¬ 
surement  invariance  across  time;  Langeheine  &  van  de  Pol,  2002). 
Hidden  Markov  models  are  a  form  of  latent  class  analysis,  known 
as  latent  transition  analysis  (LTA),  where  class  indicators  (cate¬ 
gorical  variables  indicating  RD  and  TD  groups)  are  measured  over 
time  and  individuals  are  allowed  to  transition  between  latent 
classes.  LTA  addresses  questions  concerning  prevalence  of  dis¬ 
crete  states  and  incidence  of  transitions  between  states  and  pro¬ 
duces  parameter  estimates  corresponding  to  the  proportion  of 
individuals  in  each  latent  class  initially,  as  well  as  the  probability 
of  individuals  changing  classes  with  time.  LTA  models  were 
generated  using  mixture  modeling  routines  contained  in  Mplus  5.0 
(Muthen  &  Muthen,  1998-2012).  Model  estimation  was  carried 
out  using  a  maximum  likelihood  estimator  with  robust  standard 
errors.  Detailed  discussions  of  LTA  can  be  found  in  Collins  and 
Wugalter  (1992)  and  Reboussin,  Reboussin,  Liang,  and  Anthony 
(1998).  A  cut  point  of  the  25th-percentile  (based  on  a  national 
norming  sample)  was  used  at  each  time  point  to  represent  reading 
difficulty  in  word  reading  and  reading  comprehension.1  Model  fit 
was  estimated  with  the  likelihood  ratio  chi-square.  The  likelihood 
ratio  compares  the  observed  response  proportions  with  the  re¬ 
sponse  proportions  predicted  by  the  model  (Kaplan,  2008).  As 
with  most  SEM-based  models  the  null  hypothesis  for  chi-square 
model  tests  is  that  the  specified  model  holds  for  the  given  popu¬ 
lation,  and  therefore  accepting  the  null  hypothesis  implies  that  the 
model  is  plausible.  All  models  across  cohorts  were  found  to  fit  the 
data  adequately. 


Cohort  1  (Compton  et  al.,  2010) 

Participants  in  the  first  cohort  were  selected  from  56  first- 
grade  classrooms  in  14  schools  within  an  urban  district  located 
in  middle  Tennessee.  Seven  study  schools  were  Title  I  institutions. 
We  assessed  every  formally  consented  child  (n  =  712)  with  three 
1-min  study  identification  measures:  WIF  screen,  rapid  letter  nam¬ 
ing  (RLN),  and  rapid  letter  sound  (RLS).  With  WIF  screen,  chil¬ 
dren  are  presented  with  a  single  page  of  50  high-frequency  words 
randomly  sampled  from  100  high-frequency  words  from  the  Dolch 
preprimer,  primer,  and  first-grade  level  lists  (Fuchs,  Fuchs,  & 
Compton,  2004).  They  have  1  min  to  read  words.  With  RLN  and 
RLS,  the  speed  at  which  children  name  an  array  of  the  26  letters 
and  the  sounds  of  the  letters  is  measured.  For  all  three  measures, 
scores  were  prorated  if  a  child  named  all  items  in  less  than  1  min. 
We  used  these  data  to  divide  the  712  children  into  high-,  average-, 
and  low-performing  groups  with  the  use  of  latent  class  analysis 
and  then  randomly  selected  study  children  from  each  group.  We 
oversampled  low-performing  children  to  increase  the  number  of 
struggling  readers  in  the  prediction  models.  Four  hundred  and 
eighty-five  children  were  included:  310  low  study  entry  (LSE),  83 
average  study  entry  (ASE),  and  92  high  study  entry  (HSE). 
Follow-up  testing  was  performed  at  the  end  of  first  through  fourth 
grade.  At  follow-up  in  the  spring  of  fourth  grade,  200  of  the 
original  485  children  (41%  of  the  sample)  had  moved  from  the 
district  and  were  unavailable  for  assessment. 


1  Any  cut  point  of  a  dimensional  variable  to  designate  reading  difficulty 
is  arbitrary  (Fletcher  et  al.,  1999),  however  the  25th  percentile  has  consis¬ 
tently  been  employed  in  the  literature. 
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Cohorts  2  and  3  (Gilbert  et  al.,  2013) 

The  sampling  procedures  for  Cohorts  2  and  3  were  identical 
and  are  therefore  combined  here.  Initially  we  asked  first-grade 
teachers  to  identify  the  lowest  half  of  their  class  in  terms  of 
reading  skill.  Children  in  Cohort  2  were  drawn  from  9  schools 
(5  Title  I)  in  37  first-grade  classrooms  and  children  in  Cohort  3 
from  9  schools  (5  Title  I)  and  32  first-grade  classrooms  within 
an  urban  district  located  in  middle  Tennessee.  We  screened  628 
of  the  identified  students  with  three  1 -minute  measures:  two 
WIF  lists  and  a  RLN.  Scores  were  prorated  if  a  student  named 
all  items  in  less  than  one  minute.  To  identify  an  initial  pool  of 
students  potentially  at  elevated  risk  for  poor  reading  outcomes 
we  applied  latent  class  analysis  (Nylund,  Asparouhov,  &  Mu- 
then,  2007)  to  the  three  screening  measures.  The  purpose  of 
such  an  analysis  was  to  obtain  model-based  latent  (unobserved) 
categories  of  students  who  are  performing  similarly  on  the  three 
screening  measures.  Models  were  developed  and  evaluated 


using  Mplus  version  6  (Muthen  &  Muthen,  1998-2010).  A 
clear  category  of  at-risk  students  was  identified  for  Cohort  2 
having  223  and  Cohort  3  having  214  at-risk  students.  Students 
not  populating  the  at-risk  category  were  excluded  from  further 
follow-up.  A  portion  of  the  at-risk  first  grade  children  were 
randomly  assigned  to  14  weeks  of  small  group  tutoring  or  a 
business  as  usual  control  group.  Follow-up  testing  was  per¬ 
formed  at  the  end  of  first  through  fourth  grade.  At  follow-up  in 
the  spring  of  fourth  grade,  66  of  the  original  223  children  (30% 
of  the  sample)  in  Cohort  2  and  50  of  the  original  223  children 
(23%  of  the  sample)  in  Cohort'  3  had  moved  from  the  district 
and  were  unavailable  for  assessment.  A  chi-square  test  was 
performed  to  examine  the  relation  between  first-grade  tutoring 
(treatment  and  control)  and  fourth  grade  LERD  status  (ERD, 
LERD,  and  TD).  Results  indicate  no  relationship  between  first 
grade  treatment  and  reading  class  assignment  in  fourth  grade, 
X2(2,  N  =  172)  =  .2828,  p  =  .868. 


Appendix  B 

Data  Structure  and  Equations 


Person 


Word 


Figure  Bl.  Data  structure  for  crossed-random  effects  models  (Gilbert  et  al.,  2011). 
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Table  B1 

Crossed-Rcindom  Effects  Model  Equations 


Unconditional 
(Model  0) 

Base  model 
(Model  1) 


Child-  and  word-level  predictors 
(Model  2) 


Base  models 


Level  1  (Responses^):  Logit  (it jik)  =  Xojk 
Level  2  (Person,  &  Word,):  \0Jk  =  y00  +  rolJ  +  r02i, 
roij  ~  m  crR)i)  &  r02i  ~  N( 0,  oh, 2) 

Level  1  (Responses,,*):  Logit  (tt,,*)  =  Xp/* 

Level  2  (Person,  &  Word,):  X0j*  =  y00  +  ERD  +  TD  +  r01,  +  r02,., 
f0i y  ~  MO.  grf)i)  &  r02,.  ~  MO,  g^2) 

Main  effects  model 

Level  1  (Responses,,):  Logit  (tt,-,*)  =  X0/* 

A  B 

Level  2  (Person,  &  Word,):  X0,*  =  7oo  +  ERD  +  TD  +  2  lA  +  '2lbNbj  +  foi,  +  be/. 
~  A(0,  Crt)])  &  r02;  ~  V(0,  a^2)  0-1  i 1 


Word  by  reader  group  interactions  (Model  3) 


Interaction  model 


c  D 

Level  1  (Responses,,):  Logit(i r,,*)  =  X0(*  +  2 ycLERD  *  Maj  +  2  ~1dRRD  *  A/a(- 

c~  1  A  d=\  B 

Level  2  (Person,  &  Word,.):  X0,*  =  7oo  +  +  TD  +  2  7  A;  +  2  7//4;  +  r01,  +  r02i, 

r0 1,  ~  A(0,  ct^i)  &  r02i  ~  N{ 0,  ct^2)  a=1  i=1  


Note.  TTji  —  probability  of  a  correct  response  from  pefson  j  on  work  i;  k  =  item;  y00  =  intercept;  Ma  =  item  covariate;  Nb  =  person  covariate.  Main  effects 
models  are  shown  with  random  intercepts  only  for  simplicity  but  random  slopes  were  included  in  some  models,  as  described  in  the  text. 


Table  B2 

Research  Questions  and  Associated  Covariates 


Research  question 

Model  no. 

Covariates 

What  is  the  role  of  child-level  skills  and  word-level  characteristics  as 
predictors  of  item-level  variance  on  irregular  word  reading 
accuracy,  with  a  particular  interest  in  the  role  of  child-  and 
word-level  lexical  influence? 

2 

Level  2  (person):  ERD,  TD,  PA,  OC,  RAN,  VOC,  BTQ,  NWD 
Level  2  (word):  FREQ,  SPR,  STR,  IMAG,  LENGTH,  ON 

Do  differences  exist  in  irregular  word  reading  between  classes  of 
children  identified  as  late-emerging  RD  and  those  identified  as 
early-emerging  RD  and  TD? 

1 

Level  2  (person):  ERD,  TD 

What  is  the  importance  of  child-level  by  word-level  interactions  in 
explaining  irregular  word  reading  variance,  with  a  specific  focus 
on  interactions  between  RD  status  (i.e.,  late-emerging  RD, 
early-emerging  RD,  and  TD)  and  word-level  characteristics  (i.e., 
frequency,  imageability,  spelling  pronunciation  transparency  rating, 
and  whether  an  irregular  word  is  strange)? 

3 

Level  1  (Person  X  Word  interaction):  ERD" FREQ,  TD*FREQ, 
ERD'IMAG,  TDTMAG,  ERD'SPR,  TD*SPR,  ERD'STR, 
LERD*STR 

Level  2  (person):  ERD,  TD,  PA,  OC,  RAN,  VOC,  BTQ,  NWD 
Level  2  (word):  FREQ,  SPR,  STR,  IMAG,  LENGTH,  ON 

Note.  ERD  =  early  emerging  reading  difficulty;  LERD  =  late  emerging  reading  difficulty;  RD  =  reading  difficulty;  PA  =  phonological  awareness;  OC  = 
orthographic  choice;  RAN  =  rapid  automatized  naming;  VOC  =  vocabulary;  BTQ  =  Book  Title  Questionnaire;  NWD  =  nonword  decoding;  FREQ  = 
frequency;  SPR  =  spelling  to  pronunciation  rating;  STR  =  strange  vs.  exception;  IMAG  =  imageability;  ON  =  orthographic  neighborhood  size; 
ERD*FREQ  =  Early  emerging  reading  difficulty  X  Frequency;  LERD*FREQ  =  Late  emerging  reading  difficulty  X  Frequency;  ERDTMAG  =  Early 
emerging  reading  difficulty  X  Imageability;  LERDTMAG  =  Late  emerging  reading  difficulty  X  Imageability;  ERD'SPR  =  Early  emerging  reading 
difficulty  X  Spelling  to  pronunciation  rating;  LERD’SPR  =  Late  emerging  reading  difficulty  X  Spelling  to  pronunciation  rating;  ERD'STR  =  Early 
emerging  reading  difficulty  X  Strange  vs.  exception;  LERD*STR  =  Late  emerging  reading  difficulty  X  Strange  vs.  exception. 
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Many  recent  studies  have  aimed  to  demonstrate  that  specific  types  of  reading  comprehension  depend  on 
different  underlying  cognitive  abilities.  In  these  studies,  it  is  often  implicitly  assumed  that  reading 
comprehension  is  a  multidimensional  construct.  The  general  aim  of  this  study  was  to  examine  the 
dimensionality  of  a  large  pool  of  reading  comprehension  items  differing  according  to  text  and  question 
type.  The  items  were  administered  to  996  fourth-grade  children.  We  used  multitrait,  multimethod 
modeling  to  test  for  the  existence  of  specific  text  and  question  types.  In  addition,  the  correlations  of  factor 
scores,  reflecting  the  different  measures  of  reading  comprehension,  with  word  reading  speed,  vocabulary, 
and  working  memory  were  examined.  Confirmatory  factor  analyses  revealed  that  the  specific  measures 
of  comprehension,  differing  according  to  text  and  question  type,  hardly  reflected  systematic  variation, 
after  a  general  factor  of  reading  comprehension  was  taken  into  account.  Reading  comprehension  items 
thus  largely  reflect  a  common  factor.  Factor  scores  that  were  supposed  to  reflect  specific  comprehension 
factors  were  not  reliable  and  were  hardly  related  to  word  reading  speed,  vocabulary,  and  working 
memory. 

Keywords:  reading  comprehension,  dimensionality,  cognitive  predictors 


A  child’s  level  of  reading  comprehension  can  be  measured  with 
a  variety  of  comprehension  tests  that  differ  according  to  text  and 
question  characteristics.  A  few  very  early  studies  suggest  that  such 
differences  do  not  represent  different  aspects  of  comprehension  as 
the  structure  of  reading  comprehension  appeared  to  be  merely 
one-dimensional  (Davis,  1944;  Spearritt,  1972;  Thorndike,  1973). 
More  recently,  however,  numerous  studies  have  shown  that  com¬ 
prehension  measures  can  differ  in  the  comprehension  abilities  that 
are  assessed  and  in  the  cognitive  skills  that  are  required  (e.g., 
Andreassen  &  Braten,  2010;  Bowyer-Crane  &  Snowling,  2005; 
Cutting  &  Scarborough,  2006;  Francis  et  al.,  2006;  Keenan,  Betje- 
mann,  &  Olson,  2008;  Keenan  &  Meenan,  2014;  Kendeou,  Papa- 
dopoulos,  &  Spanoudis,  2012;  Nation  &  Snowling,  1997;  Spear- 
Swerling,  2004).  In  contrast  to  the  early  studies,  these  more  recent 
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studies  assume  that  different  measures  of  reading  comprehension 
reflect  different  subskills.  Put  differendy,  they  find  that  the  structure  of 
reading  comprehension  is  multidimensional.  The  main  purpose  of  the 
current  study  was  to  further  examine  the  dimensionality  of  reading 
comprehension.  The  first  question  was  whether  it  is  possible  to  distin¬ 
guish  specific  types  of  reading  comprehension  measures  in  a  large  item 
pool.  A  second  question  was  whether  the  relations  of  word  reading 
speed,  vocabulary,  and  working  memory  with  reading  comprehen¬ 
sion  depend  on  the  type  of  reading  comprehension  measure.  In 
what  follows,  first  previous  research  about  differences  between 
specific  reading  comprehension  measures  is  discussed.  Then,  it  is 
argued  why  it  is  necessary  to  examine  the  structure  of  reading 
comprehension  before  investigating  the  relations  between  cogni¬ 
tive  predictors  and  specific  measures  of  reading  comprehension. 

Reading  Comprehension  Measures  Differ  in  Their 
Cognitive  Predictors 

Reading  comprehension  depends  on  several  cognitive  processes 
(Kendeou,  van  den  Broek,  Helder,  &  Karlsson,  2014).  According 
to  the  Simple  View  of  Reading  (Hoover  &  Gough,  1990),  the 
readers’  accuracy  in  decoding  words  and  linguistic  processes,  such 
as  vocabulary,  are  major  determinants  of  reading  comprehension 
(de  Jong  &  van  der  Leij,  2002;  Hoover  &  Gough,  1990;  Tilstra, 
McMaster,  van  den  Broek,  Kendeou,  &  Rapp,  2009;  Verhoeven  & 
van  Leeuwe,  2008).  In  more  transparent  languages,  however, 
measures  of  reading  accuracy  do  not  discriminate  between  poor 
and  good  readers  (Florit  &  Cain,  2011).  In  such  languages,  a  speed 
component  is  necessary  for  decoding  to  be  related  to  reading 
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comprehension  (de  Jong  &  van  der  Leij,  2002).  Beyond  decoding 
and  vocabulary,  working  memory  is  generally  considered  an  im¬ 
portant  contributor  to  reading  comprehension  as  comprehending  a 
text  involves  the  construction  of  relations  between  words  and 
sentences  and,  in  the  end,  the  construction  of  a  situation  model 
(e.g.,  Cain,  Oakhill,  &  Bryant,  2004;  Daneman  &  Merikle,  1996). 
In  the  current  study,  we  focused  on  three  cognitive  predictors  of 
reading  comprehension:  word  reading  speed,  receptive  vocabulary 
and  working  memory. 

The  involvement  of  cognitive  abilities  in  reading  comprehen¬ 
sion  tends  to  vary  across  reading  comprehension  tests  differing  in 
text  and  question  types.  For  example,  it  has  been  suggested  that 
long  texts  are  more  dependent  on  working  memory  than  short  texts 
(Andreassen  &  Braten,  2010).  This  is  line  with  the  construction  of 
a  situation  model:  compared  with  shorter  texts,  the  comprehension 
of  long  texts  demands  more  frequent  updates  of  the  situation 
model  and  more  information  has  to  be  stored  in  memory  (e.g., 
Kintsch,  2012).  However,  the  results  on  the  relations  between 
cognitive  processes  and  particular  measures  of  comprehension 
tests  are  equivocal  (Andreassen  &  Braten,  2010;  Basaraba, 
Yovanoff,  Alonzo,  &  Tindal.  2013;  Bowyer-Crane  &  Snowling, 
2005;  Cutting  &  Scarborough,  2006;  Eason,  Goldberg,  Young, 
Geist,  &  Cutting,  2012;  Francis  et  al.,  2006;  Keenan  et  al.,  2008; 
Keenan  &  Meenan,  2014;  Kendeou  et  al.,  2012;  Miller  et  al.,  2014; 
Nation  &  Snowling,  1997).  For  instance,  in  contrast  to  the  findings 
by  Andreassen  and  Braten  (2010),  that  long  texts  place  a  larger 
demand  on  working  memory  than  short  texts,  Keenan  and  Meenan 
(2014)  found  the  opposite  results.  Keenan  and  Meenan  take  this 
difference  to  stem  from  the  format  features  of  the  tests;  tests  with 
short  texts  require  children  to  hold  more  (detailed)  information  in 
memory  than  tests  with  long  texts.  Another  example  of  mixed 
findings  on  the  contribution  of  cognitive  abilities  on  reading  com¬ 
prehension  relates  to  reading  fluency  and  text  genre:  Garcia  and 
Cain  (2014)  argued  that  reading  fluency  was  more  important  for 
narrative  texts  than  for  expository  texts,  whereas  the  study  of 
Eason  et  al.  (2012)  revealed  a  comparable  contribution  of  fluency 
to  both  genres  of  texts.  Although  research  shows  that  the  contri¬ 
bution  of  cognitive  skills  to  reading  comprehension  depends  on  the 
reading  comprehension  test  used,  it  remains  unclear  which  specific 
text  and  question  types  are  responsible  for  the  differences  in  the 
contributions  of  the  various  cognitive  skills. 

A  methodological  explanation  for  the  inconsistent  findings  on 
the  relations  between  cognitive  abilities  and  specific  comprehen¬ 
sion  tests  might  be  that  differences  in  relations  were  not  always 
tested  for  significance  (with  the  exception  of  Kendeou  et  al., 
2012).  For  example,  regression  analyses  by  Eason  et  al.  (2012) 
showed  that  understanding  of  inferential  language  had  a  specific 
effect  on  interpretation  questions  and  questions  requiring  critical 
analyses  and  process  strategies,  but  not  on  initial  understanding 
questions.  The  differences  among  the  various  standardized  regres¬ 
sion  estimates  on  inferential  language  were  small.  For  example, 
the  nonsignificant  estimate  for  initial  understanding  questions  was 
.11,  whereas  the  significant  estimates  for  the  other  types  of  com¬ 
prehension  (interpretation  and  critical  analyses/process  strategies 
questions)  were  .17  and  .19,  respectively  (Table  4  in  Eason  et  al., 
2012).  These  standardized  regression  estimates  might  not  turn  out 
to  be  significantly  different  if  tested  on  significance.  More  gener¬ 
ally,  differences  in  the  relationships  of  cognitive  abilities  with 
types  of  reading  comprehension  tests  might  be  overestimated  if  not 


tested  and,  in  combination  with  small  samples,  might  prove  diffi¬ 
cult  to  replicate. 

The  inconsistent  findings  of  the  relations  between  cognitive 
abilities  and  comprehension  measures  might  also  be  due  to  the  use 
of  different  intact  reading  comprehension  tests  (whole  tests  for 
reading  comprehension  with  specific  text  and  question  character¬ 
istics).  Such  tests  could  differ  in  many  respects,  complicating  the 
interpretation  of  the  findings.  For  example,  Keenan  et  al.  (2008) 
found  differences  in  the  contribution  of  reading  accuracy  between 
comprehension  tests  with  cloze  items  (gap  filling  items)  and 
question-and-answer  items.  However,  because  these  comprehen¬ 
sion  tests  also  differed  in  passage  length,  an  alternative  explanation 
for  these  differences  might  be  that  reading  accuracy  is  more 
important  for  short  than  for  long  texts.  In  a  few  studies,  testing  of 
differences  among  specific  comprehension  measures  was  reported. 
These  studies  grouped  the  questions  of  one  reading  comprehension 
test  according  to  text  and  question  characteristics  and  computed 
subtest  scores  (groups  of  items  with  a  specific  text  or  question 
characteristic;  Basaraba  et  al.,  2013;  Eason  et  al.,  2012;  Miller  et 
al.,  2014).  Although  the  use  of  subtest  scores  can  be  regarded  as  an 
improvement  over  the  comparison  of  intact  measures,  matching  of 
subtests  on  (influential)  characteristics  that  are  not  of  interest, 
might  be  difficult  and  only  partially  successful. 

Another  problem  with  the  studies  that  focused  on  differences 
between  reading  comprehension  tests  is  that  they  implicitly  as¬ 
sume  that  specific  reading  comprehension  measures  truly  exist. 
However,  these  studies  did  not  examine  the  dimensionality  of 
reading  comprehension  (Andreassen  &  Braten,  2010;  Bowyer- 
Crane  &  Snowling,  2005;  Cutting  &  Scarborough,  2006;  Eason  et 
al.,  2012;  Francis  et  al.,  2006;  Keenan  et  al.,  2008;  Keenan  & 
Meenan,  2014;  Kendeou  et  al.,  2012;  Miller  et  al.,  2014;  Nation  & 
Snowling,  1997).  Therefore,  in  the  current  study  the  dimensional¬ 
ity  of  reading  comprehension  was  tested  with  a  set  of  analyses  at 
the  item  level.  Thereby,  an  unequal  distribution  of  items  over  text 
and  question  types  can  be  taken  into  account.  Moreover,  the 
analyses  provide  a  more  direct  test  of  the  existence  of  specific 
measures  of  reading  comprehension,  as  is  merely  assumed  by  the 
a  priori  formation  of  subtests. 

Examining  the  Structure  of  Reading  Comprehension 

A  few  early  studies  that  examined  the  structure  of  reading 
comprehension  revealed  that  reading  comprehension  items  mainly 
reflect  a  single  reading  comprehension  factor  (Davis,  1944;  Spe- 
arritt,  1972;  Thorndike,  1973).  Although  in  these  studies  explor¬ 
atory  factor  analyses  of  the  reading  comprehension  questions 
suggested  that  several  factors  could  be  distinguished,  reliability 
analyses  revealed  that  only  one  of  these  factors  could  be  consid¬ 
ered  as  reliable.  Moreover,  the  correlations  among  the  different 
factors  were  very  high.  A  more  recent  study  tested  the  existence  of 
literal,  inferential,  and  evaluative  factors  in  a  pool  of  20  questions 
originating  from  one  text,  while  taking  into  account  a  general 
reading  comprehension  factor  (Basaraba  et  al.,  2013).  The  results 
of  this  study  showed  that,  in  addition  to  a  general  factor,  specific 
reading  comprehension  factors  could  also  be  distinguished  thereby 
providing  evidence  that  reading  comprehension  is  a  multidimen¬ 
sional  construct.  In  the  current  study,  more  complex  structures  of 
items  were  tested  than  in  those  previous  studies,  because  we  used 
77  items  originating  from  several  different  texts.  When  examining 
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the  structure  of  such  a  large  pool  of  reading  comprehension  items, 
the  fact  that  items  are  nested  within  several  texts  should  be  taken 
into  account.  In  addition,  the  possibility  that  all  reading  compre¬ 
hension  items  are  indicators  of  the  same  construct  should  be 
controlled  for  with  a  general  reading  comprehension  factor. 

A  rigorous  method  to  test  the  structure  of  a  large  pool  of  reading 
comprehension  items  is  by  using  a  multitrait,  multimethod  model 
(MTMM  model;  Eid  et  al.,  2008;  Maul,  2013).  Such  a  model  can 
be  used  to  separate  trait,  method,  and  error  components.  A  trait 
refers  to  the  construct  that  is  intended  to  be  measured  usually  by 
two  or  more  different  tests  (Little,  2013).  The  method  components 
represent  variance  that  measures  have  in  common  because  they 
entail  the  same  method  of  measurement.  Method  variance  is  often 
regarded  as  nuisance  variance  because  it  is  not  of  principled 
interest  (Maul,  2013).  In  the  current  study,  we  consider  the  texts  in 
which  the  questions  are  nested  as  method  factors  because  the 
measurement  of  reading  comprehension  should  not  depend  on  the 
use  of  particular  texts.  In  contrast,  the  text  and  question  types  are 
regarded  as  (specific)  trait  factors  (see  Figure  1,  Model  a,  for  a 
simplified  illustration  of  the  MTMM  model). 

The  text  and  question  types,  or  traits,  in  a  MTMM  model  (see 
Figure  1 ,  Model  a)  can  be  correlated  due  to  relations  to  a  common 
higher  order  factor.  In  a  hierarchical  factor  model,  or  more  spe¬ 
cifically,  a  second-order  factor  model,  the  second-order  factor 
represents  the  relations  between  the  correlated  traits  (e.g.,  Gustafs- 
son,  1984,  2002).  Such  a  second-order  factor  model  with  separate 
method  factors  (see  Figure  1,  Model  b)  is  more  restrictive  than  a 
MTMM  model  and  if  the  model  fits,  to  be  preferred  over  a  model 
with  correlated  first-order  trait  factors  and  method  factors  only  (as 
in  Model  la;  Anthony  et  al.,  2011).  A  specific  form  of  a  second- 
order  factor  is  the  bifactor  model,  or  nested  factor  model  (e.g., 
Chen,  West,  &  Sousa,  2006;  Gustafsson  &  Aberg-Bengtsson, 
2010;  Schmid  &  Leiman,  1957).  This  model  is  especially  suited  to 
separate  general  and  specific  factors,  and  its  interpretation  is 
straightforward.  In  bifactor  models,  a  general  factor  represents  the 
variance  that  all  items  (or  tests)  have  in  common  (Chen  et  al., 
2006;  Schmid  &  Leiman,  1957).  The  uncorrelated  specific  factors 
describe  the  variance  that  items  have  in  common  after  the  common 
variance  described  by  the  general  factor  is  taken  into  account.  In 
item  bifactor  models,  each  item  thus  has  a  loading  on  the  general 
factor  and  a  second  loading  on  one  of  the  specific  factors  (Cai, 
Yang,  &  Hansen,  2011;  Gibbons  et  al.,  2007;  Gibbons  &  Hedeker, 
1992;  Undheim  &  Gustafsson,  1987;  see  Figure  1,  Model  c).  An 
important  advantage  of  a  bifactor  model  over  a  second-order  factor 
model  is  that  the  variance  of  a  set  of  items  can  be  decomposed  in 
the  variance  that  is  explained  by  a  general  reading  comprehension 
factor,  and  the  variance  that  is  explained  by  the  specific  factors 
(Gustafsson  &  Aberg-Bengtsson,  2010). 

Second-order  and  bifactor  models  have  been  regularly  used  to 
examine  the  assumedly  hierarchical  structure  of  intelligence  (e.g., 
Carroll,  2003;  Gustafsson,  1984,  2002;  Undheim  &  Gustafsson, 
1987).  More  recently,  these  models  have  been  applied  for  the 
description  of  other  cognitive  domains,  such  as  phonological 
awareness,  oral  language,  and  literacy  (Anthony  et  al.,  2011; 
Foorman,  Koon,  Petscher,  Mitchell,  &  Truckenmiller,  2015; 
Mehta,  Foorman,  Branum-Martin,  &  Patrick  Taylor,  2005;  Papa- 
dopoulos,  Kendeou,  &  Spanoudis,  2012).  To  the  best  of  our 
knowledge,  the  study  by  Basaraba  et  al.  (2013)  is  the  only  one  that 
examined  the  structure  of  reading  comprehension  with  such  com¬ 


plex  models.  As  said,  in  the  study  by  Basaraba  et  al.  (2013)  the 
structure  of  a  relatively  small  number  of  items  (20)  originating 
from  only  one  text  was  examined.  A  bifactor  model  was  fitted  in 
which  each  item  loaded  on  a  general  factor,  and  on  one  of  the 
specific  factors  that  represented  literal,  inferential  and  evaluative 
questions.  In  the  current  study,  items  came  from  different  texts  and 
differed  with  respect  to  text  type  and  question  type.  To  examine 
the  structure  of  this  pool  of  items  we  fitted  a  model  with  one 
general  factor  (like  in  a  bifactor  model)  and  various  specific  trait 
and  method  factors  (like  in  a  MTMM  model;  see  Model  c  in  Figure 
1).  In  this  complex  MTMM  model  (see  Model  c  in  Figure  1),  the 
indicators  are  the  reading  comprehension  items,  the  method  factors 
are  represented  by  the  different  texts.  Specific  trait  factors  are  the 
text  and  question  types,  and  the  general  trait  is  a  general  reading 
comprehension  factor.  As  in  a  bifactor  model,  all  latent  factors  are 
specified  to  be  uncorrelated.  As  a  result,  each  item  can  be  de¬ 
scribed  by  its  relation  with  the  general  factor,  with  several  text  and 
question  type  factors,  and  with  one  of  the  text  factors. 

Aims  of  the  Current  Study 

The  general  aim  of  the  current  study  was  to  examine  the  dimen¬ 
sionality  of  reading  comprehension.  This  was  tested  with  several 
confirmatory  factor  models  (i.e.,  a  one-factor  model,  a  bifactor 
model,  and  several  MTMM  models).  In  line  with  the  study  of 
Basaraba  et  al.  (2013),  we  hypothesized  that  specific  text  and 
question  dimensions  of  reading  comprehension  could  be  distin¬ 
guished.  After  specific  dimensions  of  reading  comprehension  were 
determined,  we  examined  the  relations  among  word  reading  speed, 
receptive  vocabulary,  and  working  memory,  and  the  various  spe¬ 
cific  comprehension  measures  (i.e.,  text  and  question  types).  Al¬ 
though  situation  model  theory  would  expect  long  texts  to  depend 
more  on  working  memory  than  short  texts,  findings  so  far  have  not 
been  consistent  and  hypotheses  of  these  studies  mainly  concern 
differences  between  reading  comprehension  tests  instead  of  differ¬ 
ences  between  specific  text  and  question  types  (Andreassen  & 
Braten,  2010;  Basaraba  et  al.,  2013;  Bowyer-Crane  &  Snowling, 
2005;  Cutting  &  Scarborough,  2006;  Eason  et  al.,  2012;  Francis  et 
al.,  2006;  Keenan  et  al.,  2008;  Keenan  &  Meenan,  2014;  Kendeou 
et  al.,  2012;  Miller  et  al.,  2014;  Nation  &  Snowling,  1997). 
Therefore,  although  we  expected  that  the  relations  of  cognitive 
abilities  with  reading  comprehension  are  dependent  on  the  text  and 
question  types,  previous  results  give  little  guidance  for  specific 
predictions.  The  current  study  might  give  more  information  about 
how  text  and  question  types  determine  the  relation  of  cognitive 
abilities  with  reading  comprehension. 

Method 

Participants 

Participants  were  996  fourth-grade  children  from  43  Dutch 
classes  of  35  elementary  schools.  These  children  participated  in  a 
longitudinal  intervention  study,  but  for  the  current  study,  we  only 
used  the  pretest  data  of  this  study.  The  sample  of  schools  in  this 
study  was  heterogeneous  with  respect  to  location,  percentage  of 
immigrants,  and  the  average  level  of  education  of  the  parents, 
representing  the  differences  between  schools  in  the  Netherlands. 
The  sample  consisted  of  506  boys  and  490  girls  with  a  mean  age 


DIMENSIONS  OF  READING  COMPREHENSION 


73 


a 


Traits 


Indicators 


Methods 


b 


Figure  1.  Examples  of  multitrait,  multimethod  models  for  reading  comprehension,  (a)  A  model  with  two 
correlated  trait  factors  and  unrelated  method  factors,  (b)  A  second-order  factor  model  for  the  traits  and  unrelated 
method  factors,  (c)  A  nested  factor  model  with  one  general  trait  factor  and  unrelated  method  and  specific  trait 
factors. 
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of  9  years  and  7  months  ( SD  =  5.69  months).  Almost  5%  of  the 
entire  sample  was  born  outside  the  Netherlands  and  from  1 0%  of 
the  entire  sample,  both  parents  were  born  outside  the  Netherlands. 

In  the  Netherlands,  children  are  in  elementary  school  from  the  age 
of  4  until  the  age  of  1 2  years  (2  years  of  kindergarten  and  Grade  I 
through  Grade  6).  Literacy  instruction  starts  in  first  grade.  Most  fourth 
graders  are  able  to  read  fluently.  Education  in  reading  comprehension 
usually  starts  at  the  end  of  Grade  2  or  at  the  beginning  of  Grade  3. 
Teachers  are  required  to  spend  1  to  2  hr  per  week  on  teaching  reading 
comprehension.  During  the  comprehension  lessons,  children  are 
taught  to  pay  attention  to  important  characteristics  of  a  text,  such  as 
the  title  and  headings,  and  also  to  connectives  and  linking  words.  In 
addition,  they  learn  strategies  for  how  to  deal  with  different  texts,  such 
as  predicting,  questioning,  and  summarizing. 

Design 

For  the  measurement  of  reading  comprehension,  a  total  of  77 
reading  comprehension  questions  was  used.  The  questions  origi¬ 
nated  from  three  different  reading  comprehension  tests  (further 
described  in  the  Instruments  section)  encompassing  nine  different 
texts.  The  texts  differed  in  text  and  question  types.  Text  types 
concerned  text  genre  (narrative  or  expository)  and  text  length 
(short  or  long).  Question  types  were  level  of  comprehension  (lit¬ 
eral,  inferential,  and  evaluative)  and  question  format  (four-option, 
open-ended,  and  true-false).  Narrative  texts  had  characters  and  a 
plot,  consist  of  everyday  vocabulary,  followed  a  timeline,  were 
written  in  past  tense,  and  were  often  fictional  (Best,  Floyd,  & 
McNamara,  2008;  Eason  et  al.,  2012).  Expository  texts  provided 
information  about  a  specific  topic,  included  technical  vocabulary, 
and  were  not  structured  in  a  temporal  sequence.  Literal  questions 
examined  children’s  understanding  of  information  stated  explicitly 
in  the  text  (Basaraba  et  al.,  2013;  Eason  et  al.,  2012;  Miller  et  al., 
2014).  Inferential  questions  assessed  children’s  ability  to  make 
inferences  and  to  draw  conclusions  about  information  that  was 
stated  in  the  text.  Evaluative  questions  required  an  integration 
between  information  stated  in  the  text  and  background  knowledge, 
or  required  the  use  of  reading  strategies;  children  had  to  evaluate 
the  information  acquired  from  the  text. 

The  coding  of  all  items  with  respect  to  the  text  and  question 
types  was  done. by  the  first  and  second  author  of  this  study  based 
on  the  guidelines  provided  above.  The  raters  coded  the  texts  and 
questions  independently.  There  were  no  differences  between  the 
judgments  of  the  text  genres.  Texts  from  the  Aarnoutse  and  Kap- 


inga  (AK)-Reading  Comprehension  test  with  a  length  of  122  to 
288  words  were  coded  as  short  texts  and  the  texts  from  the 
Progress  in  International  Reading  Literacy  Study  (PIRLS)  tests 
with  832  and  920  words  were  coded  as  long  texts.  With  respect  to 
level  of  comprehension  (i.e.,  literal,  inferential,  and  evaluative 
questions),  73%  of  the  questions  were  scored  similarly.  This 
corresponded  with  a  Cohen’s  Kappa  of  .58,  which  can  be  classified 
as  a  moderate  interrater  reliability  (Viera  &  Garrett,  2005).  The 
two  raters  reached  consensus  about  the  classification  of  the  other 
27%  of  the  questions  through  discussion.  For  question  format, 
four-option,  true-false  questions,  and  open-ended  questions  were 
distinguished.  The  distribution  of  the  questions  over  the  text  and 
question  types  is  presented  in  Table  1. 

Instruments 

Reading  comprehension.  Reading  comprehension  was  as¬ 
sessed  through  two  different  reading  comprehension  tests.  The 
PIRLS  (Mullis,  Martin,  Gonzalez,  &  Kennedy,  2003)  tests  were 
several  reading  comprehension  tests  that  contained  a  narrative  or 
expository  text  followed  by  a  number  of  questions.  In  the  current 
study,  one  test  with  a  narrative  text  (“Enemy  Pie”)  and  one  test 
with  an  expository  text  (“The  Mystery  of  the  Giant  Tooth”)  were 
used.  Within  each  test,  four  different  levels  of  comprehension  were 
assessed  and  used  to  test  children’s  ability  to  (a)  focus  on  and 
retrieve  explicitly  stated  information,  (b)  make  straightforward 
inferences,  (c)  interpret  and  integrate  ideas  and  information,  and 
(d)  examine  and  evaluate  information  in  the  text.  Each  test  con¬ 
tained  two  different  question  formats:  multiple-choice  and  open- 
ended  questions.  The  multiple-choice  questions  consisted  of  four 
options  from  which  children  had  to  select  the  correct  one.  For  the 
open-ended  questions,  children  were  asked  to  write  down  their 
answer.  Children’s  answers  on  the  open-ended  questions  were 
scored  by  trained  test  assistants  based  on  standardized  scoring 
guidelines.  Each  correct  multiple-choice  question  was  awarded  1 
point;  each  (partly)  correct  open-ended  question  (1  or)  2  points. 
The  text  “Enemy  Pie”  consisted  of  832  words  and  16  questions. 
The  text  “The  Mystery  of  the  Giant  Tooth”  contained  920  words 
and  17  questions.  Before  the  start  of  the  test,  children  were  shown 
examples  of  how  to  answer  the  different  question  formats.  After 
that,  children  were  asked  to  read  the  texts  silently  and  to  complete 
all  questions.  The  texts  were  available  throughout  the  entire  as¬ 
sessment.  All  children  received  enough  time  to  finish  the  test;  each 
text  took  approximately  40  min  to  complete.  Cronbach’s  alphas  for 
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Distribution  of  Questions  per  Text  and  Question  Type 


Text/Question  type 

la 

lb 

2a 

2b 

3a 

3b 

3c 

4a 

4b 

4c 

Total 

la.  Nan'ative  texts 

— 

— 

$ 

34 

lb.  Expository  texts 

— 

— 

43 

2a.  Short  texts 

18 

26 

— 

— 

44 

2b.  Long  texts 

16 

17 

— 

— 

33 

3a.  Literal  questions 

8 

25 

19 

14 

— 

— 

— 

33 

3b.  Inferential  questions 

17 

12 

14 

15 

— 

— 

— 

29 

3c.  Evaluative  questions 

9 

6 

11 

4 

— 

— 

— 

15 

4a.  Four-option  questions 

16 

21 

22 

15 

13 

15 

9 

— 

_ 

_ 

37 

4b.  Open-ended  questions 

9 

9 

0 

18 

7 

8 

3 

_ 

_ 

18 

4c.  True-false  questions 

9 

13 

22 

0 

13 

6 

3 

— 

— 

_ 

22 
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the  PIRLS  test  with  a  narrative  and  an  expository  text  were  .77  and 
.76,  respectively. 

The  AK-Reading  Comprehension  test  (Aamoutse  &  Kapinga, 
2006)  is  part  of  a  standardized  battery  of  tests  to  measure  reading 
comprehension  from  Grades  1  through  6.  In  this  study,  the  test  for 
Grades  4,  5,  and  6  was  used.  The  test  consisted  of  a  booklet  with 
seven  short  texts  (122  to  288  words)  and  44  multiple-choice 
questions,  covering  both  narrative  and  expository  texts.  The 
multiple-choice  questions  had  either  four  (A,  B,  C,  D)  or  two 
(true-false)  options.  Each  text  was  followed  by  six  or  seven  ques¬ 
tions:  three  or  four  four-option  questions  and  three  or  four  true- 
false  questions.  Before  the  test,  one  example  text  was  given  as  a 
practice  trial.  Children  were  required  to  read  all  texts  silently  and 
complete  all  questions.  All  texts  were  continued  to  be  available 
during  the  test.  Test  administration  took  approximately  50  min. 
Cronbach’s  alpha  was  .79. 

The  decoding  and  comprehension  levels  of  the  reading  compre¬ 
hension  texts  were  calculated  with  the  Programma  voor  bereken- 
ing  Cito  Leeslndex  voor  het  Basisonderwijs  [Program  for  calcu¬ 
lating  the  Cito  Reading  Index  for  primary  education  (P-CLIB 
program;  Evers,  2008)].  This  program  positions  levels  of  text 
decoding  and  comprehension  to  grade  levels  in  which  these  text 
levels  are  generally  presented.  The  decoding  level  of  texts  was 
examined  based  on  average  length  of  the  words,  and  proportion  of 
high-frequent  words.  The  comprehension  level  was  based  on  the 
average  word  length,  average  sentence  length,  variation  in  words, 
and  proportion  of  high-frequent  words.  The  decoding  levels  of  the 
nine  different  texts  ranged  from  halfway  Grade  3  to  halfway  Grade 
6.  The  comprehension  levels  of  the  children  were  between  the  end 
of  fourth  grade  and  the  end  of  sixth  grade. 

Word  reading  speed.  For  word  reading  speed  we  used  the 
Een-minuut-test  (Brus  &  Voeten,  1979).  This  is  a  standardized 
Dutch  test  often  used  to  measure  word  reading  speed.  Children 
were  presented  with  a  list  of  words  of  increasing  difficulty  and 
asked  to  accurately  read  aloud  as  many  words  as  possible  within  1 
min.  The  list  consisted  of  1 16  words  that  increased  in  length  from 
one  to  five  syllables.  The  score  was  the  number  of  words  read 
correctly  within  1  min.  Reliability  scores  could  not  be  computed 
with  the  data  of  the  present  study.  The  mean  parallel-test  reliability 
is  .90  (van  den  Bos,  Lutje  Spelberg,  Scheepstra,  &  de  Vries,  1994). 

Receptive  vocabulary.  An  adapted  form  of  the  Dutch  version 
(Schlichting,  2005)  of  the  Peabody  Picture  Vocabulary  Test  was  used 
to  measure  receptive  vocabulary  (Dunn  &  Dunn,  1997).  In  the  present 
study,  Sets  8  to  1 3  were  used,  which  consisted  of  72  items  in  total. 
Each  item  consisted  of  four  pictures.  The  test  was  administered  in  a 
classroom  setting  instead  of  individually  for  practical  reasons  and 
took  approximately  30  min.  Children  received  a  booklet  with  the 
items  and  were  instructed  to  underline  a  picture  out  of  four  alterna¬ 
tives  that  corresponded  to  the  word  said  by  the  test  assistant.  Before 
the  start  of  the  test,  two  practice  items  were  given.  All  children 
finished  the  entire  test.  The  total  score  was  the  number  of  correct 
answers.  Within  our  sample,  Cronbach’s  alpha  was  .69. 

Verbal  working  memory.  An  experimental  listening  span  test 
was  chosen  to  measure  verbal  working  memory.  For  each  item, 
children  were  required  to  listen  to  a  series  of  sentences.  The  sentences 
consisted  of  three  to  seven  words  and  were  presented  by  a  test 
assistant.  After  each  sentence  was  presented,  the  children  had  to  decide 
whether  the  sentence  was  correct  and  remember  the  last  word  of  the 
sentence.  The  words  that  had  to  be  remembered  were  monosyllabic  and 


commonly  known  by  6-year-old  children  (Schaerlaekens,  Kohnstamm, 
&  Lejaegere,  1999).  At  the  end  of  each  series  of  sentences,  the  last  words 
had  to  be  recalled  in  the  same  order  as  the  sentences  were  presented.  The 
test  items  increased  in  length  from  two  to  five  sentences.  There 
were  20  items,  four  for  each  number  of  sentences.  The  test  was 
administered  individually  and  ended  when  children  failed  on  all 
four  items  of  the  same  number  of  sentences.  Before  the  start  of  the 
test,  two  example  items  of  respectively  one  and  two  sentences 
were  given.  The  score  was  the  number  of  items  (sentence  series) 
recalled  in  the  coiTect  order.  Since  this  was  an  experimental  test 
that  was  stopped  when  children  made  too  many  errors,  the  reli¬ 
ability  could  be  calculated  if  the  missing  items  were  coded  as 
incorrect.  In  case  of  a  stopping  rule,  the  difficulty  of  items  is 
presumed  to  increase.  It  can  be  assumed  that  these  items  have  been 
made  incorrectly.  Based  on  these  assumptions,  Cronbach’s  alpha 
was  .69. 

Procedure 

The  tests  were  administered  in  four  test  sessions.  In  the  first  test 
session,  both  PIRLS  tests  were  carried  out  in  a  classroom  setting. 
On  the  second  day,  the  AK-Reading  Comprehension  test  and  the 
Peabody  Picture  Vocabulary  test  were  administered,  also  in  a 
classroom  setting.  The  listening  span  and  the  word  reading  test 
were  administered  individually  during  a  third  session. 

Analyses 

Four  questions  of  the  PIRLS  tests  on  which  2  points  could  be 
acquired,  were  recoded  to  make  the  scoring  of  all  questions  compa¬ 
rable,  thus  dichotomous.  For  one  easy  question  (for  which  47%  of  the 
children  had  2  points),  both  0  and  1  point  were  scored  as  0,  and  2 
points  was  scored  as  1  point.  For  the  three  more  difficult  questions 
(for  which  54%  to  77%  had  0  points),  both  1  and  2  points  were  scored 
as  1  point. 

The  structure  of  the  reading  comprehension  items  was  examined 
with  several  confirmatory  factor  models.  First,  a  one-factor  model 
was  estimated  in  which  all  items  load  on  a  single  reading  com¬ 
prehension  factor.  Since  each  item  also  pertains  to  one  of  the  nine 
texts,  second,  a  bifactor  model  was  specified  by  adding  nine  text 
factors  (one  for  the  items  of  each  text).  Third,  different  complex 
MTMM  models  were  estimated.  These  models  are  more  complex 
than  a  standard  MTMM  model,  because  all  items  load  on  a  general 
trait  factor,  a  method  factor  and  several  specific  trait  factors  (see 
Figure  1,  Model  c).  The  factors  for  text  genre,  text  length,  level  of 
comprehension,  and  question  format  were  added  to  the  model  with 
the  general  factor  and  the  text  factors  separately.  In  the  fourth, 
fifth,  and  sixth  step  the  text  and  question  type  factors  were  added 
one  by  one.  In  the  final  model,  items  loaded  on  a  general  reading 
comprehension  factor,  one  of  the  nine  text  factors,  and  on  all  text 
and  question  type  factors  (see  Figure  2).  In  this  complex  MTMM 
model,  the  indicators  are  the  reading  comprehension  items  and  the 
methods  factors  are  represented  by  the  different  texts.  Specific  trait 
factors  are  the  text  and  question  types,  and  the  general  trait  is  a 
general  reading  comprehension  factor.  As  in  a  bifactor  model,  all 
latent  factors  are  specified  to  be  uncorrelated.  As  a  result,  each 
item  can  be  described  by  its  relation  with  the  general  factor,  with 
several  text  and  question  type  factors,  and  with  one  of  the  text 
factors.  An  alternative  model  was  presented  as  well,  that  is,  a 
model  without  the  nine  text  factors. 
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Figure  2.  The  final  complex  multitrait,  multimethod  model.  In  this  model,  all  items  have  a  loading  on  the 
reading  comprehension  factor  (general  trait),  on  four  of  the  text  and  question  type  factors  (specific  traits),  and 
on  one  of  the  text  factors  (methods). 


The  factor  analyses  were  conducted  with  Mplus  Version  7.11 
(Muthen  &  Muthen,  2012).  Robust  weighted  least  squares  (WLS) 
estimation  was  used  to  obtain  parameter  estimates.  The  models 
only  contained  dichotomous  items  and  WLS  was  the  estimator; 
therefore,  theta-parameterization  was  used.  Since  the  children  in 
our  sample  were  nested  within  classes,  there  is  some  dependency 
in  the  data.  The  intraclass  correlation  coefficients  were  .08,  .12, 
and  .13  for  the  three  different  reading  comprehension  tests.  Mplus 
can  account  for  the  nested  structure  of  the  data  and  adjust  the  standard 
errors  accordingly  (by  using  the  TYPE  =  COMPLEX  command). 

Overall  model  fit  was  evaluated  with  the  chi-square  goodness-of-fit 
test-statistic,  the  root  mean  square  error  of  approximation  (RMSEA), 
and  the  comparative  fit  index  (CFI;  Kline,  2011).  A  significant  chi- 
square  indicated  poor  model  fit,  and  a  model  with  a  nonsignificant 
chi-square  has  good  fit  to  the  data.  An  RMSEA  below  .05  was  taken 
as  good  approximate  fit,  values  between  .05  and  .08  indicated  satis¬ 
factory  approximate  fit,  and  an  RMSEA  over  .10  was  considered  as 
poor  approximate  fit  (Browne  &  Cudeck,  1993).  A  CFI  larger  than  .95 
indicated  good  incremental  model  fit,  and  larger  than  .90  was  con¬ 
sidered  acceptable  (Hu  &  Bentler,  1999).  To  test  differences  in  model 
fit  between  two  nested  models,  the  chi-square  difference  test  was  used 
(Kline,  2011).  Because  the  difference  between  the  chi-square  values 


of  two  nested  models  estimated  with  WLS  does  not  have  a  chi-square 
distribution,  the  regular  chi-square  difference  test  is  not  valid.  There¬ 
fore,  the  corrected  chi-square  difference  test  (with  Satorra-Bentler 
correction;  DiEET  EST  option  in  Mplus),  which  can  be  calculated  with 
Mplus,  was  used  in  this  study.  In  addition  to  the  more  global  model  fit, 
the  local  fit  of  the  model  was  investigated  by  inspecting  the  factor 
loadings  and  calculating  reliability  scores  for  the  specific  factors. 

Factor  scores  for  the  specific  comprehension  measures  were 
extracted  from  the  final  model  to  determine  whether  the  relations 
between  reading  comprehension  and  word  reading  speed,  vocab¬ 
ulary,  and  working  memory  are  dependent  on  the  type  of  reading 
comprehension  measure.  Therefore,  the  factor  scores  of  the  latent 
factors  of  the  model  were  added  to  a  dataset  with  the  cognitive 
predictors.  The  correlations  among  those  factor  scores  and  word 
reading  speed,  vocabulary,  and  working  memory  were  examined. 

Results 

Data  Screening  and  Descriptive  Statistics 

Data  were  checked  for  outliers  and  missing  values.  Scores  that 
were  more  than  three  standard  deviations  above  or  below  the  mean 
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were  omitted.  In  total,  less  than  1%  of  the  scores  were  missing.  In 
most  cases  these  were  caused  by  illness  of  the  child. 

The  maximum,  mean,  and  standard  deviation  of  the  measures 
for  reading  comprehension,  word  reading  speed,  vocabulary,  and 
working  memory  are  displayed  in  Table  2.  All  variables  were 
normally  distributed  with  values  of  skewness  ranging  from  —.56  to 
.23  and  values  of  kurtosis  between  -.68  and  .62  (Kline,  2011). 
The  correlations  of  the  reading  comprehension  tests  with  word 
reading  speed,  vocabulary,  and  working  memory  were  moderate 
(see  Table  3).  According  to  the  grade-referenced  norms  of  the 
AK-Reading  Comprehension  test  and  the  word  reading  test,  the 
children  in  our  sample  had  average  levels  of  word  reading  and 
reading  comprehension.  Therefore,  the  level  of  word  reading  and 
the  comprehension  levels  of  the  texts  were  adequately  matched  to 
the  ability  level  of  the  children.  This  match  is  also  visible  in  the 
fact  that  floor  and  ceiling  effects  were  not  found  on  any  of  the 
tests. 

Testing  the  Structure  of  Reading  Comprehension 

We  examined  the  structure  of  reading  comprehension  with  a 
series  of  confirmatory  factor  models.  As  a  first  step,  a  model  with 
one  general  factor  was  estimated  (Model  1  in  Table  4).  The 
chi-square  value  was  significant,  which  indicates  poor  overall 
model  fit.  However,  the  approximate  fit  (RMSEA)  of  this  model  to 
the  data  was  good,  and  the  incremental  fit  (CFI)  was  acceptable. 

Next,  models  with  method  factors  were  specified.  In  Model  2,  a 
bifactor  model  was  estimated  in  which  a  general  reading  compre¬ 
hension  factor  and  nine  text  factors  were  presumed,  one  for  each 
text  in  which  the  questions  were  nested.  This  model  could  not  be 
estimated.  The  factor  loadings  of  three  items  on  the  corresponding 
text  factor  appeared  to  be  extremely  high.  Fixing  these  factor 
loadings  to  .90  and  the  residual  variances  of  these  items  to  .19 
solved  the  estimation  problems.  Both  the  overall  model  fit,  the 
approximate  fit,  and  incremental  fit  of  this  model  were  good.  In 
addition,  the  fit  of  this  bifactor  model  was  significantly  better  than 
the  fit  of  the  model  with  a  general  factor  only  (chi-square  differ¬ 
ence  test  for  Model  1  vs.  Model  2,  see  Table  4). 

Third,  we  estimated  models  in  which  both  method  factors  and 
specific  reading  comprehension  factors  were  included.  In  these 
complex  MTMM  models  we  specified  a  general  reading  com¬ 
prehension  factor,  the  texts  as  method  factors,  and  one  text  or 
question  type  as  specific  trait  factors  (see  Model  c  in  Figure  1, 
and  Models  3a  to  3d  in  Table  4).  The  overall,  approximate,  and 
incremental  fit  of  these  models  was  good,  and  the  fit  of  all  these 


Table  2 

Descriptive  Statistics  for  Reading  Comprehension,  Word 
Reading  Speed,  Vocabulary,  and  Working  Memory  Measures 


Measure 

N 

Maximum 

M 

SD 

PIRLS  “Enemy  Pie” 

995 

19 

10.18 

3.35 

PIRLS  “Mystery  of  the  Giant  Tooth” 

992 

18 

8.31 

3.52 

AK-Reading  Comprehension  Test 

986 

44 

26.83 

6.26 

Word  reading  speed 

991 

1 16 

61.97 

13.38 

Peabody  Picture  Vocabulary  Test 

988 

72 

35.15 

6.29 

Listening  span 

989 

16 

5.42 

2.36 

Note.  PIRLS  =  Progress  in  International  Reading  Literacy  Study;  AK  = 
Aamoutse  and  Kapinga. 


models  was  significantly  better  than  the  fit  of  the  bifactor 
model,  Model  2  (chi-square  difference  test  for  Models  3a- d  vs. 
Model  2,  see  Table  4). 

In  Step  3a  and  the  fourth,  fifth,  and  sixth  step  (Models  3a,  4,  5, 
and  6  in  Table  4),  the  factors  for  text  genre,  text  level,  level  of 
comprehension,  and  question  format  were  added  step  by  step.  The 
fit  of  all  models  was  good  and  significantly  better  than  the  fit  of  the 
previous  model  (chi-square  difference  test  for  Model  3a  vs.  4,  4  vs. 
5,  and  5  vs.  6  in  Table  4).  The  fit  of  the  sixth  model,  including  all 
text  and  question  types,  had  the  best  fit.  Hence,  the  MTMM  model, 
with  a  general  reading  comprehension  factor,  nine  text  factors, 
four  text  type  factors,  and  six  question  type  factors,  was  taken  as 
the  final  model  (see  Figure  2  for  an  illustration  of  the  final  model). 

Interpretation  of  the  Factor  Models  of 
Reading  Comprehension 

The  median,  minimum,  and  maximum  factor  loadings  per  spe¬ 
cific  latent  factor  are  shown  in  Table  5.  The  factor  loadings  of  the 
items  on  the  specific  latent  factors  were  often  very  small  or,  in 
some  instances,  even  negative.  These  relatively  low  factor  load¬ 
ings  show  that  the  items  have  little  in  common  after  controlling  for 
the  general  reading  comprehension  factor.  The  variance  explained 
by  the  latent  factors  of  the  final  model  was  calculated  with  the 
following  formula;  R2  =  (2\,.)/77,  where  77  is  the  total  number  of 
items  and  A.,  is  the  standardized  factor  loading  of  a  particular  item 
on  a  specific  latent  factor.  The  general  reading  comprehension 
factor  explained  18.70%  of  the  variance.  The  additional  variance 
explained  by  the  text  factors,  and  the  text  and  question  type 
factors,  ranged  from  0.46%  to  2.25%.  The  low  factor  loadings  of 
the  items  and  the  little  additional  variance  explained  by  these 
factors  implies  that  they  are  hard  to  interpret. 

Additionally,  we  calculated  the  reliability  of  the  factor  scores  that 
can  be  derived  for  each  factor  from  the  final  model.  The  reliability  of 
a  factor  score  was  calculated  with  the  following  formula  (Brown, 
1989):  pc  =  (5A,)2/[(2\,)2  +  20f,].  In  this  formula,  pc  represents  the 
reliability  of  the  composite  or  latent  factor,  A,  is  the  standardized 
factor  loading  of  a  particular  item  on  a  specific  latent  factor,  and  0ei 
is  the  standardized  residual  variance  of  an  item.  Because  the  residual 
variances  were  not  provided  by  Mplus,  these  were  calculated  with  the 
following  formula:  0E,  =  1  -  A.2.  The  general  factor  score  had  a 
reliability  of  pc  =  .94.  The  reliabilities  of  the  latent  text  factor  scores 
ranged  from  pc  =  .00  to  pc  =  .48  (medianp  =  .25;  see  Table  5).  The 
reliabilities  of  the  latent  text  and  question  type  factor  scores  were 
between  pc  =  .00  and  pc  =  .53  (medianp  =  .08).  In  all,  these  results 
showed  that  the  reliability  of  the  factor  scores  derived  from  the 
general  factor  was  high,  whereas  the  reliabilities  of  the  factor 
scores  from  the  specific  latent  text,  text  type,  and  question  type 
factors  were  low.  The  reliabilities  of  the  text  factor  scores  were 
somewhat  higher  than  the  reliabilities  of  the  text  and  question  type 
factor  scores. 

The  low  median  factor  loadings  and  the  subsequent  limited 
additional  variance  explained  by  the  text  and  question  type  factors 
suggest  that  the  specific  trait  factors  add  little  to  the  model  when 
controlling  for  a  general  reading  comprehension  factor  and  text 
factors.  However,  the  negative  factor  loadings  might  also  be  ex¬ 
plained  by  overparameterization  of  the  model.  To  diminish  the 
chance  of  overfitting  and  to  test  whether  the  specific  text  and 
question  type  factors  could  explain  more  variance,  we  tested  an 
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Table  3 

Pearson  Correlations  Between  Reading  Comprehension,  Word  Reading  Speed,  Vocabulary,  and 


Working  Memory  Measures 


Measure 

1 

2 

3 

4 

5  6 

1.  PIRLS  “Enemy  Pie” 

1 

2.  PIRLS  “Mystery  of  the  Giant  Tooth” 

.65** 

1 

3.  AK-Reading  Comprehension  Test 

.62** 

.65** 

1 

4.  Word  reading  speed 

.35** 

.40** 

.35** 

1 

5.  Peabody  Picture  Vocabulary  Test 

.41** 

.45** 

.51** 

.24** 

1 

6.  Listening  span 

.31** 

.31** 

.33** 

.21** 

.20**  1 

\ 

Note.  PIRLS  =  Progress  in  International  Reading  Literacy  Study;  AK  =  Aamoutse  and  Kapinga. 


p  <  .01. 


alternative  model  in  which  the  nine  text  factors  were  not  included. 
Thus,  this  model  consisted  of  a  general  reading  comprehension 
factor  and  several  specific  trait  factors  (i.e.,  the  specific  text  and 
question  types).  This  model  had  a  good  fit  to  the  data,  x2(2,541)  = 
2,597.92,  p  =  .211,  RMSEA  =  .005,  90%  confidence  interval  (Cl) 
[.000,  .008],  CFI  —  .99.  Inspection  of  the  model  parameters 
revealed  that  the  median  factor  loadings  of  the  text  and  question 
type  factors  in  this  alternative  model  ranged  from  -.09  to  .21 
(median  \  =  .02)  and  the  reliabilities  of  the  text  and  question  type 
factors  were  between  .00  and  .63  (median  p  —  .09).  Thus,  discard¬ 
ing  the  text  factors  and  thereby  maximizing  the  variance  to  be 
explained  by  the  specific  text  and  question  type  factors,  did  not 
make  a  difference.  Also  in  this  model,  factor  loadings  on  the 
specific  factors  were  low  or  even  negative,  making  it  difficult  to 
denote  a  particular  interpretation  to  these  factors. 

Correlations  of  Cognitive  Abilities  and  Specific 
Measures  of  Reading  Comprehension 

To  examine  whether  the  correlations  between  the  cognitive 
abilities  and  reading  comprehension  are  dependent  on  the  specific 
measures  of  reading  comprehension,  factor  scores  were  extracted 
from  the  final  model  (Model  6  in  Table  4).  Obviously,  correlations 


between  cognitive  abilities  and  unreliable  factor  scores  are  ex¬ 
pected  to  be  low.  Indeed,  as  shown  in  Table  6,  correlations  of  the 
factor  scores  derived  from  the  text  and  question  type  factors,  with 
word  reading  speed,  vocabulary,  and  working  memory  were  very 
low.  Put  differently,  we  hardly  found  any  specific  relations  of  the 
cognitive  abilities  with  the  various  specific  measures  of  reading 
comprehension  after  the  general  factor  was  taken  into  account.  In 
contrast,  the  correlations  of  the  cognitive  abilities  with  the  general 
reading  comprehension  factor  were  substantial.  The  correlations  of 
word  reading  speed  and  working  memory  with  the  general  reading 
comprehension  factor  were  moderate,  .42  and  .36,  respectively. 
The  correlation  of  vocabulary  with  reading  comprehension  was 
high  (.54). 

Discussion 

The  general  aim  of  the  current  study  was  to  examine  the  dimen¬ 
sionality  of  reading  comprehension.  The  results  of  our  study  revealed 
that  reading  comprehension  questions  could  largely  be  represented  by 
a  single  reading  comprehension  factor.  Specific  factors  for  text  and 
question  types  were  not  reliable  and  explained  very  little  additional 
variance.  The  results  also  showed  that  there  were  hardly  any  differ¬ 
ences  in  the  relations  of  word  reading  speed,  vocabulary,  and  working 


Table  4 


Values  of  Selected  Fit  Statistics  for  the  Different  Confirmatory  Factor  Models 


Number  of  the  model 

Name  of  the  model 

x2 

df 

RMSEA 

90%  Cl 

CFI 

Ax2a 

A  df 

1 

1G 

3,219.67** 

2,849 

.011 

[.009,  .013] 

.92 

2 

1G9T 

2,853.14 

2,775 

.005 

[.000,  .009] 

.98 

1  vs.  2 

1,069.21** 

74 

3a 

1G9T2G 

2,753.69 

2,698 

.005 

[.000,  .008] 

.99 

2  vs.  3a 

149.56** 

77 

3b 

1G9T2L 

2,759.88 

2,698 

.005 

[.000,  .008] 

.99 

2  vs.  3b 

131.30** 

77 

3c 

1G9T3C 

2,761.19 

2,698 

.005 

[.000,  .008] 

.99 

2  vs.  3c 

136.04** 

77 

3d 

1G9T3F 

2,755.26 

2,698 

.005 

[.000,  .008] 

.99 

2  vs.  3d 

155.50** 

77 

4 

1G9T2G2L 

2,659.53 

2,621 

.004 

[.000,  .008] 

.99 

3a  vs.  4 

132.32** 

77 

5 

1G9T2G2L3C 

2,568.50 

2,544 

.003 

[.000,  .008] 

1.00 

4  vs.  5 

128.63** 

77 

6 

1G9T2G2L3C3F 

2,479.75 

2,467 

.002 

[.000,  .007] 

1.00 

5  ys.  6 

119.02** 

77 

Note.  RMSEA  =  root  mean  square  error  of  approximation;  Cl  =  confidence  interval;  CFI  =  comparative  fit  index;  1  =  one  general  reading 
comprehension  factor;  2  =  one  general  factor  +  nine  factors  for  the  different  texts;  3a  =  one  general  factor  +  nine  text  factors  +  two  factors  for  the 
different  text  genres;  3b  =  one  general  factor  +  nine  text  factors  +  two  factors  for  the  different  text  lengths;  3c  =  one  general  factor  +  nine  text  factors  + 
three  factors  for  the  different  levels  of  comprehension;  3d  =  one  general  factor  +  nine  text  factors  +  three  factors  for  the  different  question  formats;  4  = 
one  general  factor  +  nine  text  factors  +  two  text  genre  factors  +  two  text  length  factors;  5  =  one  general  factor  +  nine  text  factors  +  two  text  genre 
factors  +  two  text  length  factors  +  three  level  of  comprehension  factors;  6  =  one  general  factor  +  nine  text  factors  +  two  text  genre  factors  +  two  text 
length  factors  +  three  level  of  comprehension  factors  +  three  question  format  factors  (final  model). 
a  Corrected  chi-square  difference  test  (Satorra-Bentler  correction). 

**p  <  .01. 
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Table  5 


Percent  of  Variance  Explained  by  Latent  Factors  and  Reliabilities  of  the  Latent  Factors  in  the  Final  Model 


Latent  factor  k  Median  X 


General  factor 

77 

.42 

Text  1 

16 

.08 

Text  2 

17 

.18 

Text  3 

6 

.13 

Text  4 

6 

.36 

Text  5 

7 

.14 

Text  6 

6 

.26 

Text  7 

6 

-.09 

Text  8 

6 

.14 

Text  9 

7 

.22 

Min  X 

Max  X 

R2(%) 

P 

.06 

.64 

18.70 

.94 

-.38 

.23 

.71 

.00 

-.02 

.55 

1.54 

.45 

-.11 

.87 

1.18 

.23 

.10 

.61 

1.28 

.48 

-.06 

.81 

1.08 

.25 

.01 

.82 

1.19 

.39 

-.61 

.10 

.59 

.14 

-.23 

.38 

.46 

.10 

-.07 

.89 

1.80 

.44 

Latent  factor  k  Median  X 


Narrative  texts 

34 

.04 

Expository  texts 

43 

-.02 

Short  texts 

44 

-.02 

Long  texts 

33 

.16 

Literal 

33 

.10 

Inferential 

29 

.04 

Evaluative 

15 

-.02 

Four-option 

37 

.00 

Open-ended 

18 

.10 

True-false 

22 

.03 

Min  X 

Max  X 

R\%) 

P 

-.29 

.56 

1.34 

.05 

-.31 

.44 

1.56 

.00 

-.65 

.37 

1.71 

.01 

-.08 

.49 

2.25 

.53 

-.13 

.39 

1.18 

.24 

-.24 

.43 

1.02 

.08 

-.14 

.78 

1.43 

.13 

-.37 

.32 

1.04 

.01 

-.17 

.42 

.89 

.17 

-.15 

.82 

1.21 

.08 

Note,  k  —  number  of  items;  X  —  factor  loading;  R2  =  variance  explained;  p  =  reliability. 


memory  with  these  specific  factors  of  reading  comprehension  when  a 
general  reading  comprehension  factor  was  taken  into  account. 

The  structure  of  reading  comprehension  was  examined  with 
several  confirmatory  factor  models.  These  models  revealed  that  all 
text  factors  as  well  as  all  text  and  question  type  factors  (text  genre, 
text  length,  level  of  comprehension  and  question  format)  added 
significantly  to  the  model  fit.  The  final  MTMM  model  for  the  77 
comprehension  questions  consisted  of  a  general  reading  compre¬ 
hension  factor,  nine  text  factors,  and  10  specific  factors  for  the 
various  text  and  question  types.  Importantly,  however,  these  spe¬ 
cific  factors  cannot  be  interpreted  as  a  reflection  of  specific  text 
and  question  types,  which  is  in  line  with  the  very  early  studies  on 
the  dimensions  of  reading  comprehension  (Davis,  1944;  Spearritt, 
1972;  Thorndike,  1973).  We  observed  that  only  a  few  items  had  a 
substantial  loading  on  each  text  factor.  The  loadings  of  all  other 
items  that  were  expected  to  be  indicative  of  a  factor  were  generally 
low.  Clearly  the  few  items  with  a  substantial  loading  on  a  specific 

Table  6 

Correlations  Between  Factor  Scores  for  Specific  Reading 
Comprehension  Measures  With  Word  Reading  Speed, 


Vocabulary,  and  Working  Memory 


Latent  factor 

Word  reading 
speed 

Vocabulary 

Working 

memory 

General  factor 

.42** 

.54** 

.36** 

Text  1 

.01 

.03 

.05 

Text  2 

io*. 

.06 

.05 

Text  3 

-.06 

-.06 

-.03 

Text  4 

.07 

.11** 

.11** 

Text  5 

-.04 

.08* 

.02 

Text  6 

-.02 

.02 

.02 

Text  7 

-.01 

.03 

-.01 

Text  8 

.02 

.02 

.03 

Text  9 

-.02 

.01 

.00 

Narrative  texts 

.02 

-.04 

.01 

Expository  texts 

-.00 

-.05 

-.01 

Short  texts 

.11** 

.03 

-.00 

Long  texts 

.07* 

.06 

.10** 

Literal 

.08** 

.10** 

.10** 

Inferential 

-.02 

.08** 

.05 

Evaluative 

.05 

.06 

.02 

Four-option 

-.08** 

.07* 

-.01 

Open-ended 

.06* 

.03 

.04 

True-false 

-.07* 

-.08* 

-.03 

*  p  <  .05.  *><.01. 


factor  have  something  in  common,  even  after  the  general  reading 
comprehension  factor  is  controlled.  This  might  be  due  to  the 
common  text,  or  even  passage  within  the  text,  to  which  they 
belong.  Possibly  these  items  within  a  text  are  related  because  they 
depend  on  common  prior  knowledge  or  are  related  to  the  same 
particular  aspect  of  the  situation  model  of  the  text. 

The  dimensionality  of  cognitive  constructs  has  been  examined 
for  several  decades  now  (e.g.,  Anthony  et  al.,  2011;  Foorman  et  al., 
2015;  Gustafsson,  1984,  2002;  Mehta  et  al.,  2005;  Papadopoulos  et 
al.,  2012).  In  contrast  to  the  very  early  studies  on  the  dimension¬ 
ality  of  reading  comprehension  (Davis,  1944;  Spearritt,  1972; 
Thorndike,  1973)  in  which  exploratory  factor  analyses  were 
mainly  used,  we  used  confirmatory  factor  analyses,  in  particular, 
MTMM  modeling.  This  MTMM  modeling  might  be  regarded  as  a 
specific  type  of  hierarchical  modeling  which  has  also  been  used  to 
examine  the  structure  of  intelligence  (e.g.,  Carroll,  2003;  Gustafs¬ 
son,  1984,  2002;  Undheim  &  Gustafsson,  1987).  The  specific  type 
of  hierarchical  modeling  used  in  this  study,  enabled  us  to  decom¬ 
pose  the  variance  explained  by  the  general  and  specific  factors,  and 
also  compute  the  reliability  of  the  factor  scores  (Gustafsson  & 
Aberg-Bengtsson,  2010). 

Our  findings  indicated  that  the  general  reading  comprehension 
factor  explained  only  18.70%  of  the  variance  in  the  questions, 
showing  that  there  is  a  lot  of  (unexplained)  item-specific  variance. 
In  a  single  factor  model  with  all  77  items,  the  general  factor 
explained  20.37%  of  the  variance.  Basaraba  et  al.  (2013)  con¬ 
structed  both  single  factor  and  bifactor  models  and  found  that 
around  31%  of  the  variance  was  explained  by  the  general  factor. 
Note  that  in  the  current  study  questions  from  nine  different  texts 
load  on  the  general  factor,  while  in  the  study  of  Basaraba  et  al. 
(2013),  who  conducted  comparable  analyses,  all  20  questions 
originated  from  one  text.  In  additional  analyses,  we  constructed  a 
factor  model  with  questions  originating  from  a  single  text  and  then 
found  comparable  percentages  of  variance  explained  by  the  gen¬ 
eral  factor  (e.g.,  28.71%  for  the  17  questions  from  the  text  “The 
Mystery  of  the  Giant  Tooth”  and  34.47%  for  the  16  questions  from 
the  text  “Enemy  Pie”).  Thus,  the  general  reading  comprehension 
factor  explains  more  variance  if  it  is  not  distinguished  from  text 
specific  variance.  Consequently,  as  the  item  pool  becomes  larger 
and  the  number  of  texts  increases,  less  variance  will  be  explained 
by  the  general  factor. 

The  inability  to  find  reliable  specific  factors  of  reading  compre¬ 
hension  might  be  caused  by  the  fact  that  as  compared  to  previous 
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studies,  we  used  a  relatively  homogeneous  set  of  texts  and  ques¬ 
tions  (e.g.,  Keenan  et  al.,  2008).  For  example,  the  Peabody  Indi¬ 
vidual  Achievement  Test  and  Woodcock-Johnson  Passage 
Comprehension-3  (e.g.,  Keenan  &  Meenan,  2014),  that  are  often 
used  reading  comprehension  tests  in  the  United  States,  strongly 
differ  from  the  Dutch  reading  comprehension  tests  included  in  the 
present  study.  The  Peabody  Individual  Achievement  Test  requires 
children  to  read  a  single  sentence  and  choose  the  correct  picture 
that  best  expresses  the  meaning  of  the  sentence  after  the  sentence 
is  removed.  In  the  Woodcock-Johnson  Passage  Coinprehension-3, 
children  are  asked  to  read  passages  consisting  of  one  or  two 
sentences  and  fill  in  the  missing  word.  In  contrast  to  these  reading 
comprehension  tests,  standardized  Dutch  reading  comprehension 
tests  always  consist  of  paragraphs  with  several  sentences  and  never 
require  comprehension  of  a  picture.  The  Dutch  tests  consist  of 
texts  accompanied  by  questions  with  different  formats  and  differ¬ 
ent  levels  of  comprehension.  This  study  showed  that  for  Dutch 
comprehension  tests,  most  of  the  variance  of  the  items  is  explained 
by  a  general  reading  comprehension  factor,  and  specific  item  type 
factors  do  not  explain  much  additional  variance.  Possibly,  differ¬ 
ences  in  the  correlates  of  reading  comprehension  measures  can 
only  be  found  when  comparing  comprehension  measures  that 
differ  more  strongly. 

Another  explanation  for  the  fact  that  reading  comprehension 
items  are  largely  represented  by  a  single  reading  comprehension 
factor  might  be  that  there  are  many  abilities  that  influence  chil¬ 
dren’s  comprehension  of  a  text  (Shanahan,  2014).  It  seems  un¬ 
likely  that  children  would  apply  a  certain  subset  of  those  abilities 
to  answer  specific  reading  comprehension  questions  differing  in 
text  or  question  types.  A  third  explanation  might  have  to  do  with 
the  way  reading  comprehension  tests  are  constructed  (Shanahan, 
2014).  To  end  up  with  a  reliable  test,  items  have  to  be  highly 
correlated  with  each  other.  During  the  development  of  a  reading 
comprehension  test,  items  that  are  not  correlated  highly  with  other 
items  will  be  removed  from  the  test.  This  reduces  the  chance  that 
a  test  measures  different  subskills  of  reading  comprehension. 
However,  it  should  be  noted  that  though  the  construction  of  each 
reading  comprehension  test  might  led  to  a  one-dimensional  test,  in 
the  current  study  the  comprehension  items  came  from  three  dif¬ 
ferent  tests.  Nevertheless,  the  items  reflected  mainly  one  dimen¬ 
sion. 

With  respect  to  some  text  and  question  types,  the  finding  that 
specific  text  and  question  types  could  not  be  distinguished  might 
be  considered  desirable.  For  example,  questions  with  different 
question  formats  should  not  require  different  comprehension  pro¬ 
cesses.  However,  based  on  previous  studies,  we  expected  to  find 
specific  factors  for  literal  and  inferential  questions.  Some  previous 
studies  have  found  that  literal  questions  are  easier  and  require  the 
understanding  of  information  that  is  literally  presented  in  the  text, 
while  inferential  questions  are  more  difficult  and  require  inference 
making  (e.g.,  Basaraba  et  al.,  2013).  In  the  present  study  however, 
literal  and  inferential  questions  could  not  be  distinguished.  Addi¬ 
tional  analyses  of  the  data  in  this  study  showed  that  children’s 
performance  was  lower  on  the  inferential  questions  than  on  the 
literal  questions.  Thus,  literal  and  inferential  questions  depended 
on  similar  comprehension  abilities,  as  revealed  by  confirmatory 
factors  analyses,  but  literal  questions  require  a  lower  level  of 
reading  comprehension  ability  than  inferential  questions.  A  reason 
for  the  differences  between  the  current  study  and  previous  research 


might  be  that  a  substantial  number  of  the  children  in  our  study  was 
observed  to  answer  both  literal  and  inferential  questions  by  heart, 
that  is  without  consultation  of  the  text.  These  students  first  read  the 
entire  text  and  then  answered  all  associated  questions  without 
looking  back  to  the  text.  As  a  result,  children  used  their  situation 
model  of  the  text  both  for  answering  literal  and  inferential  ques¬ 
tions. 

Another  finding  was  that  the  correlations  of  word  reading  speed, 
vocabulary,  and  working  memory  with  the  specific  text  and  ques¬ 
tion  type  factor  scores  were  very  low  and  did  not  differ  substan¬ 
tially.  This  was  to  be  expected  given  the  unreliability  of  these 
specific  factor  scores.  Previous  studies  did  find  substantial  rela¬ 
tions  between  cognitive  abilities  and  specific  measures  of  reading 
comprehension  (e.g.,  Eason  et  al.,  2012).  The  difference  in  the 
strength  of  the  relations  is  probably  caused  by  the  fact  that  these 
studies  did  not  take  into  account  a  general  reading  comprehension 
factor.  In  this  study,  the  correlations  of  word  reading  speed, 
working  memory  and  vocabulary  with  the  highly  reliable  general 
reading  comprehension  factor  score  were  substantial.  The  correla¬ 
tion  of  vocabulary  with  the  general  reading  comprehension  factor 
scores  could  even  be  qualified  as  high.  These  findings  are  in  line 
with  correlations  of  these  abilities  with  reading  comprehension 
reported  in  previous  studies  (e.g.,  Oakhill  &  Cain,  2012). 

Limitations 

The  current  study  has  a  number  of  limitations.  The  first  is  the 
relatively  low  interrater  reliability  for  the  scoring  of  the  reading 
comprehension  items  according  to  level  of  comprehension.  Al¬ 
though  a  Cohen’s  kappa  of  .58  is  often  qualified  as  a  moderate 
interrater  reliability,  and  thus  is  sufficient,  it  might  still  be  consid¬ 
ered  undesirably  low  (McHugh,  2012;  Viera  &  Garrett,  2005). 
However,  Cohen’s  kappa  assumes  that  raters  guess  marginal  pro¬ 
portions  of  their  ratings  (McHugh,  2012).  This  did  not  seem  to 
have  happened.  When  raters  tried  to  reach  consensus  on  items  on 
which  they  had  disagreed,  it  became  clear  that  raters  had  never 
guessed  outcomes  but  had  always  made  knowledge-based  judg¬ 
ments.  In  studies  were  guessing  is  less  likely,  the  percentage  of 
agreement  is  a  better  estimate  of  the  interrater  reliability.  The 
percentage  of  agreement  of  73%  in  the  current  study  can  be 
interpreted  as  strong  agreement  (LeBreton  &  Senter,  2008).  Nev¬ 
ertheless,  27%  of  the  items  were  not  scored  similarly.  To  examine 
whether  this  has  influenced  the  results,  we  carried  out  an  additional 
analysis.  In  this  MTMM  model,  only  the  items  that  were  coded 
similarly  were  used  in  a  model  with  a  general  reading  comprehen¬ 
sion  factor,  text  factors,  and  text  and  question  type  factors.  Also  in 
this  additional  analysis,  the  specific  text  and  question  type  factors 
turned  out  to  be  unreliable.  Thus,  it  seems  unlike  that  the  relatively 
low  interrater  reliability  has  affected  the  results  of  this  study. 

A  second  limitation  is  the  relatively  low  reliability  of  the  vo¬ 
cabulary  test.  This  low  reliability  is  npt  in  line  with  studies  that 
used  the  original  form  of  the  Peabody  Picture  Vocabulary  test 
(Dunn  &  Dunn,  1997).  In  the  current  study,  the  vocabulary  test 
was  administered  in  class.  Therefore,  all  children  were  adminis¬ 
tered  the  same  items  and,  unlike  in  the  original  version,  the  items 
were  less  well  adapted  to  the  level  of  the  child.  The  design  of  our 
study  probably  led  to  administering  too  few  items,  resulting  in  a 
decrease  of  the  reliability  of  the  test  as  compared  to  the  original 
version.  A  low  reliability  of  the  vocabulary  measure  might  have 
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underestimated  its  relation  with  reading  comprehension.  The  reli¬ 
ability  of  the  working  memory  test  was  also  rather  low.  Neverthe¬ 
less,  the  correlations  of  these  cognitive  abilities  and  the  general 
reading  comprehension  factor  were  substantial  and  generally  in 
line  with  those  found  in  previous  studies  (e.g.,  Oakhill  &  Cain, 
2012).  Thus  the  somewhat  lower  reliabilities  of  some  of  the  ability 
measures  did  probably  have  a  quite  small  effect  on  the  results  of 
this  study. 

Implications  and  Suggestions  for  Future  Research 

Our  main  conclusion  that  reading  comprehension  is  a  single 
dimension  or  skill  and  does  not  consist  of  different  subskills  has 
important  implications  for  teachers.  However,  this  does  not  imply 
that  all  tests  are  equally  good  as  a  measure  of  this  single  dimen¬ 
sion.  Some  reading  comprehension  tests  will  be  a  better  indicator 
of  the  general  reading  comprehension  factor  than  other  tests,  and, 
consequently,  the  scores  on  these  tests  will  be  less  affected  by 
construct-irrelevant  variance  (Gustafsson  &  Aberg-Bengtsson, 
2010). 

Another  implication  is  that  errors  on  specific  reading  compre¬ 
hension  questions  are  not  diagnostic  for  problems  with  the  acqui¬ 
sition  of  specific  subskills.  As  a  result,  it  might  not  be  necessary  to 
adapt  instruction  to  different  subskills  (Shanahan,  2014).  Instead, 
teachers  could  focus  more  generally  on  how  students  can  be 
instructed  to  comprehend  texts.  This  implication  should,  however, 
be  considered  with  some  caution.  Future  studies  should  reveal 
whether  the  results  of  this  study  can  be  generalized.  This  study 
contained  a  relatively  homogeneous  set  of  reading  comprehension 
items.  Future  studies  should  focus  on  reading  comprehension  tests 
that  differ  to  a  larger  extent  and  at  reading  comprehension  tests  in 
which  items  are  deliberately  constructed  to  measure  a  specific 
subskill  of  reading  comprehension.  In  addition,  although  at  fourth 
grade  the  development  of  reading  comprehension  seems  well 
underway,  it  might  be  that  as  children  grow  older  further  special¬ 
ization  of  types  of  reading  comprehension  might  evolve.  If  that  is 
the  case,  reading  comprehension  might  become  a  multidimen¬ 
sional  construct  when  children  grow  older. 

Conclusions 

Examining  the  dimensionality  of  a  large  pool  of  reading  com¬ 
prehension  items  in  a  sample  of  almost  1,000  fourth  graders 
strongly  suggests  that  reading  comprehension  is  a  one-dimensional 
construct.  Specific  measures  of  comprehension  varying  according 
to  text  and  question  type  hardly  reflected  systematic  variation.  As 
a  result,  the  cognitive  abilities  involved  in  reading  comprehension 
did  not  depend  on  text  and  question  type. 
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The  sequence  in  which  problems  of  different  concepts  are  studied  during  instruction  impacts  concept 
learning.  For  example,  several  problems  of  a  given  concept  can  be  studied  together  (blocking)  or  several 
problems  of  different  concepts  can  be  studied  together  (interleaving).  In  the  current  study,  we  demon¬ 
strate  that  the  2  sequences  impact  concept  induction  differently  as  they  differ  in  the  temporal  spacing  and 
the  temporal  juxtaposition  of  to-be-learned  concept  problems,  and  in  the  cognitive  processes  they  recruit. 
Participants  studied  6  problems  of  3  different  statistical  concepts,  and  then  were  tested  on  their  ability  to 
correctly  classify  new  problems  on  a  final  test.  Interleaving  problems  of  different  to-be-learned  concepts, 
rather  than  blocking  problems  by  concept,  enhanced  classification  performance,  replicating  the  inter¬ 
leaving  effect  (Experiment  1).  Introducing  temporal  spacing  between  successive  problems  decreased 
classification  performance  in  the  interleaved  schedule — consistent  with  the  discriminative-contrast 
hypothesis  that  interleaving  fosters  between-concept  comparisons — and  increased  classification  perfor¬ 
mance  in  the  blocked  schedule — consistent  with  the  study-phase  retrieval  hypothesis  that  temporal 
spacing  causes  forgetting  and  subsequent  retrieval  enhances  memory  (Experiment  2).  Temporally 
juxtaposing  problems  of  concepts  3-at-a-time  rather  than  1-at-a-time  improved  overall  classification 
performance,  particularly  in  a  blocked  schedule — consistent  with  the  commonality-abstraction  hypoth¬ 
esis  that  blocking  fosters  within-concept  comparisons  (Experiment  3).  All  participants  also  completed  a 
working  memory  capacity  (WMC)  task,  findings  of  which  suggest  that  the  efficacy  of  the  above  study 
sequences  may  be  related  to  individual  differences  in  WMC. 

Keywords:  categorization,  induction,  interleaving,  math  learning 


When  students  are  introduced  to  a  concept,  the  instruction  is 
often  paired  with  several  illustrative  problems.  Exposure  to 
these  problems  enables  them  to  abstract  general  principles  that 
define  the  concept,  and  to  subsequently  apply  the  abstracted 
principles  to  new  situations.  For  instance,  the  graduate  student 
who  has  designed  several  experiments  should  be  able  to  recog¬ 
nize  when  a  new  research  problem  requires  the  application  of  an 
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independent  t  test,  and  be  able  to  differentiate  that  problem 
from  problems  that  require  the  application  of  other  statistical 
tests. 

Importantly,  the  sequence  in  which  the  problems  of  different 
types  of  concepts  are  presented  during  instruction  affects  concept 
learning.  For  example,  in  a  blocked  schedule,  learners  study  sev¬ 
eral  problems  representative  of  a  single  concept  (e.g.,  the  indepen¬ 
dent  t  test)  are  studied  consecutively  to  extract  the  key  features 
before  moving  onto  the  next  concept.  In  contrast,  in  an  interleaved 
schedule,  problems  from  different  concepts  are  studied  intermixed 
together  (e.g.,  independent  t  test,  dependent  t  test,  ANOVA)  to 
learn  the  subtle  differences  that  exist  among  them.  Each  study 
schedule  method  contributes  to  learning  differently.  Although  the 
majority  of  the  research  suggests  that  interleaved  schedules  pro¬ 
duce  greater  learning  gains  than  blocked  schedules  (see  Rohrer, 
2012  for  a  review),  some  research  has  shown  that  blocked  sched¬ 
ules  also  have  the  potential  to  optimize  concept  learning.  Under¬ 
standing  the  conditions  when  each  of  the  two  schedules  is  effective 

i\ 

may  provide  a  theoretical  basis  for  tailoring  learning  and  instruc¬ 
tion. 

In  the  current  study,  we  examined  the  effects  of  blocked  and 
interleaved  schedules  on  the  learning  of  statistical  concepts.  Spe¬ 
cifically,  we  examined  the  different  factors  and  underlying  pro¬ 
cesses  of  inductive  learning  that  determine  when  (temporal  factors) 
and  how  (cognitive  processes)  one  schedule  may  be  more  or  less 
effective  than  the  other. 
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Prior  Relevant  Research 

The  effects  ol  interleaved  and  blocked  schedules  have  been 
studied  for  the  learning  of  motor  skills,  mathematics  procedures, 
and  tor  perceptual  and  text-based  categorization.  Interleaving  is 
often  more  effective  than  blocking,  but  blocking  can  also  be 
effective  under  certain  conditions  (see  Carvalho  &  Goldstone, 
2015  and  Rohrer,  2012  for  reviews).  There  are  several  factors  that 
determine  when  one  schedule  is  more  effective  than  the  other.  In 
the  current  study,  we  focused  on  two  temporal  factors — juxtapo¬ 
sition  and  spacing — that  may  interact  with  study  schedules  to 
influence  concept  learning.  In  what  follows,  we  summarize  re¬ 
search  on  the  generality  of  the  interleaving  benefit,  and  then 
review  evidence  that  suggests  that  the  temporal  juxtaposition  of 
concept  problems  and  the  temporal  spacing  between  concept  prob¬ 
lems  promote  unique  cognitive  processes  that  differentially  drive 
interleaving  and  blocking  benefits. 

Generality  of  the  Interleaving  Effect 

Interleaved  practice  has  been  shown  to  promote  better  proce¬ 
dural  learning  than  blocked  practice,  in  particular  for  learning 
mathematics  procedures  (e.g.,  Le  Blanc  &  Simon,  2008;  Mayfield 
&  Chase,  2002;  Rohrer  &  Taylor,  2007;  Taylor  &  Rohrer,  2010). 
For  instance,  in  a  mathematics  learning  study,  participants  learned 
to  find  the  volume  of  four  different  geometric  solids  by  practicing 
problems  of  all  four  types  in  an  interleaved  fashion  or  by  practic¬ 
ing  problems  of  the  same  type  together  in  a  blocked  fashion 
(Rohrer  &  Taylor,  2007).  Interleaving  practice  produced  better 
scores  than  blocking  practice  on  a  problem-solving  test  where 
participants  were  shown  new  problems  and  asked  to  identify  the 
appropriate  test  type,  recall  the  corresponding  formula,  and  then 
execute  the  solution  procedure. 

Similarly,  in  a  perceptual  category  induction  study,  Komell  and 
Bjork  (2008)  asked  participants  to  learn  artists’  painting  styles 
either  by  presenting  different  paintings  of  the  same  artist  in  a  row 
(i.e.,  blocked)  or  by  mixing  paintings  of  different  artists  (i.e., 
interleaved)  such  that  no  two  paintings  by  the  same  artist  appeared 
consecutively.  Interleaving  study  produced  higher  scores  than 
blocking  study  on  a  final  classification  test  where  participants  were 
shown  previously  unseen  paintings  by  the  studied  artists  and  asked 
to  identify  the  artist  responsible  for  each  new  painting.  This  result 
has  been  replicated  many  times  for  perceptual  category  induction 
not  only  of  artists’  painting  styles  (Komell,  Castel,  Eich  &  Bjork, 
2010;  Kang  &  Pashler,  2012),  but  also  of  butterfly  species  (Bim- 
baum,  Komell,  Bjork  &  Bjork,  2013)  and  of  bird  families  (Wahl- 
heim,  Dunlosky,  &  Jacoby,  2011).  The  range  of  research  areas 
studied  suggests  that  the  interleaving  effect  is  a  general  learning 
phenomenon,  and  in  the  present  article  we  extend  this  effect  to  the 
inductive  learning  of  statistical  concepts. 

In  addition  to  replicating  the  interleaving  effect,  we  examined 
the  different  factors  and  processes  that  determine  when  and  how 
interleaving  may  be  more  or  less  effective  than  the  other.  There  are 
several  processes  that  support  category  induction,  but  each  of  these 
is  affected  differently  depending  on  study  sequence  manipulations. 
To  learn  categories,  learners  must  be  able  to  understand  the  bound¬ 
aries  of  categories — they  must  learn  the  features  that  distin¬ 
guish  one  category  from  another  (i.e.,  make  between-category 
discriminative-contrast)  and  appreciate  the  commonalities  that  are 
shared  across  category  exemplars  (within-category  commonality- 


abstraction).  We  propose  that  an  interleaving  sequence  facilitates 
the  former,  whereas  a  blocking  sequence  facilitates  the  latter.  In 
addition  to  the  two  comparative  processes,  there  are  memory 
components  that  support  category  induction.  Learners  must  be  able 
to  hold  in  working  memory  features  from  one  exemplar  to  the  next 
to  make  comparisons,  and  they  also  have  to  be  able  to  memorize 
features  that  correspond  to  each  category.  We  propose  that  de¬ 
creased  temporal  spacing  (i.e.,  presenting  category  exemplars 
three-at-a-time  rather  than  one-at-a-time)  facilitates  the  former, 
whereas  increased  temporal  spacing  (via  the  spacing  effect;  Ce- 
peda  et  al.,  2006,  2008,  2009)  facilitates  the  latter.  In  the  following 
sections,  we  discuss  the  evidence  for  the  factors  and  processes  in 
relation  to  their  effect  on  blocking  and  interleaving  sequences. 

Role  of  Temporal  Juxtapositions  in  Fostering  Unique 
Concept  Comparisons 

Blocking  and  interleaving  differ  in  the  temporal  juxtaposi¬ 
tions  of  problems  of  to-be-learned  concepts.  In  an  interleaved 
schedule,  problems  of  different  concepts  are  juxtaposed  to¬ 
gether,  which  fosters  between-concept  comparisons,  consistent 
with  the  discriminative-contrast  hypothesis.  In  a  blocked  sched¬ 
ule,  problems  of  the  same  concept  are  juxtaposed  together, 
which  fosters  within-concept  comparisons,  consistent  with  the 
commonality-abstraction  hypothesis. 

Interleaving  Enables  Between-Category  Comparison: 
The  Discriminative- Contrast  Hypothesis 

The  discriminative-contrast  account  proposes  that  when  prob¬ 
lems  of  a  concept  differ  on  a  number  of  dimensions  from  problems 
of  another  concept,  juxtaposing  these  problems  through  interleav¬ 
ing  makes  the  discriminative  features  salient,  which  facilitates 
concept  induction  (Bimbaum  et  al.,  2013;  Komell  &  Bjork,  2008). 
Conversely,  blocking  problems  of  a  concept  together  renders  it 
harder  to  notice  salient  features  that  differ  between  concepts, 
which  can  make  this  schedule  less  effective  under  these  conditions. 

Evidence  for  the  discriminative-contrast  hypothesis  comes  from 
a  study  by  Taylor  and  Rohrer  (2010)  in  which  participants  learned 
a  two-step  formula  for  calculating  each  of  four  properties  of  a 
prism  (i.e.,  number  of  comers,  edges,  faces,  and  angles)  by  prac¬ 
ticing  problems  of  all  four  problem  types  in  an  interleaved  fashion 
or  by  practicing  problems  of  the  same  type  together  in  a  blocked 
fashion.  Interleaving  practice  produced  better  scores  than  blocking 
practice  on  a  problem-solving  test  where  participants  were  shown 
new  problems  and  asked  to  identify  the  appropriate  problem  type, 
recall  the  corresponding  formula,  and  then  execute  the  solution 
procedure.  The  authors  concluded  that  the  observed  advantage  was 
caused  by  the  fact  that  participants  who  practiced  in  a  blocked 
manner  made  more  discrimination  errors  during  problem  solving 
(i.e.,  they  used  the  formula  that  was  appropriate  for  one  of  the 
other  kinds  of  problems).  Thus,  interleaving  provided  participants 
with  an  opportunity  to  practice  how  to  execute  a  solution  proce¬ 
dure,  and  discriminate  when  a  given  formula  was  appropriate, 
whereas  blocking  did  not  provide  the  same  opportunity  given  that 
every  consecutive  problem  in  this  schedule  concerned  the  same 
formula  (e.g.,  Rohrer,  2009;  Taylor  &  Rohrer,  2010). 

Perhaps  the  strongest  evidence  for  the  discriminative-contrast 
hypothesis  comes  from  a  study  by  Bimbaum  et  al.  (2013,  learning 
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of  butterflies’  species).  They  attempted  to  interrupt  the  contrast 
processes  presumed  to  be  critical  to  the  interleaving  benefit  by 
inserting  unrelated  fillers  between  exemplars.  Indeed,  performance 
decreased  in  the  temporally  spaced  interleaved  schedule  compared 
to  the  typical  uninterrupted  interleaved  schedule.  This  finding 
provides  evidence  for  the  role  of  category  comparisons  in  the 
benefit  of  interleaving  given  that  making  comparisons  is  more 
difficult  in  interleaved  schedules  with  temporal  spacing. 

Blocking  Enables  Within-Concept  Comparison: 

The  Commonality-Abstraction  Hypothesis 

Commonality-abstraction  account  proposes  that  when  problems 
within  a  concept  differ  from  each  other  on  a  number  of  dimen¬ 
sions,  juxtaposing  these  problems  through  blocking  makes  the 
common  features  salient,  which  facilitates  concept  induction.  Con¬ 
versely,  interleaving  problems  of  a  concept  with  problems  of  other 
concepts  renders  it  harder  to  notice  salient  features  that  are  com¬ 
mon  within  a  concept,  which  can  make  this  schedule  less  effective 
under  these  conditions. 

Evidence  for  the  commonality-abstraction  hypothesis  comes 
from  a  series  of  recent  studies  that  manipulated  concept  discrim- 
inability.  Their  results  suggest  that  blocking  is  particularly  effec¬ 
tive  for  high-discriminability  concepts  where  all  the  problems  of  a 
given  concept  are  highly  dissimilar  from  the  problems  of  other 
concepts,  and  the  few  features  that  are  shared  across  concepts  are 
difficult  to  identify  (Kurtz  &  Hovland,  1956;  Whitman  &  Gamer, 
1963;  Goldstone,  1996;  Carpenter  &  Mueller,  2013).  For  instance, 
Carvalho  and  Goldstone  (2014)  created  artificial  categories  for 
which  the  exemplars  shared  very  few  similarities  within  and  be¬ 
tween  categories.  They  found  that  blocking  the  exemplars  of  a 
category  produced  better  subsequent  classification  performance 
than  did  interleaving  exemplars  of  all  categories  together.  Simi¬ 
larly,  Zulkiply  and  Burt  (2013),  who  manipulated  difficulty  of 
category  discriminations,  also  found  that  a  blocked  schedule  was 
more  effective  for  learning  highly  discriminable  categories  (with 
presumably  low  between-category  similarity). 

The  findings  discussed  so  far  come  mainly  from  the  research  on 
category  and  concept  learning,  which  report  limited  benefits  of 
blocking.  Research  on  analogical-reasoning,  the  process  of  iden¬ 
tifying  how  aspects  of  one  item  correspond  with  aspects  of  another 
item,  can  also  provide  a  theoretical  perceptive  on  the  role  of 
within-concept  comparisons.  Making  an  analogy  may  be  similar  to 
making  a  within-concept  comparison — determining  what  can  be 
mapped  across  two  items  is  similar  to  determining  why  two 
problems  share  the  same  concept  name.  Studies  in  this  domain 
have  shown  that  comparing  two  or  more  items  promotes  deep 
processing  of  the  content  because  their  similarities  become  high¬ 
lighted,  helping  learners  to  abstract  principles  that  may  be  applied 
in  the  future  (e.g.,  Catrambone  &  Holyoak,  1989;  Gentner,  1983; 
Gick  &  Holyoak,  1983).  In  their  study,  Gentner,  Loewenstein,  and 
Thompson  (2003)  asked  participants  to  either  compare  two  cases 
of  a  negotiation  principle  or  study  each  case  independently.  Those 
who  engaged  in  the  comparison  developed  better  representations 
of  the  principle,  and  were  better  able  to  identify  and  solve  novel 
cases  that  required  the  same  negotiation  principle.  Gentner  et  al. 
argued  that  comparisons  allowed  participants  to  discover  the  un¬ 
derlying  structure  shared  by  both  cases. 


The  temporal  juxtaposition  of  problems  of  to-be-learned  con¬ 
cepts  fosters  unique  comparisons  across  problems  of  different 
concepts  in  the  interleaved  schedules  and  across  problems  of 
the  same  concept  in  the  blocked  schedules.  One  goal  of  the 
current  study  is  to  further  test  the  discriminative-contrast  and 
commonality-abstraction  hypotheses.  In  most  of  the  studies, 
problems  are  presented  one  at  a  one,  and  the  concept  comparison 
available  to  the  participants  by  way  of  temporal  juxtaposition  is 
not  explicitly  invited.  Studies  in  the  analogical-reasoning  domain 
highlight  the  importance  of  encouraging  learners  to  explicitly 
compare  items  during  instruction.  In  fact,  simply  having  two  cases 
be  presented  side-by-side  rather  than  on  separate  pages  fostered 
the  abstraction  of  the  underlying  structure  (e.g.,  Gentner  et  al., 
2003;  Kurtz,  Miao,  &  Gentner,  2001).  Consistent  with  this  line  of 
research,  we  predict  that  both  interleaving  and  blocking  should 
produce  greater  learning  gains  under  conditions  in  which  partici¬ 
pants  are  allowed  to  simultaneously,  rather  than  sequentially,  view 
problems  of  concepts,  as  simultaneous  sequences  provide  a  more 
explicit  learning  context  to  elicit  the  critical  differences  between 
concepts  and  commonalities  within  a  concept. 

Role  of  Temporal  Spacing  in  Enhancing  Memory  of 
Concept  Features 

Blocking  and  interleaving  differ  in  the  amount  of  temporal 
spacing  that  exists  between  problems  of  to-be-learned  concepts.  In 
an  interleaved  schedule,  problems  of  the  same  concept  are  tempo¬ 
rally  spaced  apart  with  problems  of  other  concepts  interposed 
between  them.  In  a  blocked  schedule,  problems  of  the  same 
concept  are  presented  consecutively  with  no  temporal  lag  or  spac¬ 
ing  between  them.  From  the  memory  literature,  there  are  many 
studies  that  demonstrate  the  benefit  of  spacing  (Cepeda  et  al., 
2006;  Dempster,  1988),  often  referred  to  as  the  “spacing  effect.” 
The  study-phase  retrieval  account  of  spacing  (Thios  & 
D’Agostino,  1976;  Bjork,  1975)  proposes  that  the  interval  (or 
temporal  spacing)  between  repetitions  of  the  same  item  promotes 
forgetting,  leading  to  more  effortful  retrieval,  which  then  in  turn 
strengthens  the  learning  as  compared  to  when  repetitions  are 
massed.  In  other  words — quite  counterintuitively — the  spacing 
effect  relies  on  forgetting  to  promote  learning.  Spacing  is  only 
beneficial,  however,  as  long  as  retrieval  of  the  prior  instance  is 
successful. 

However,  the  stimuli  and  the  nature  of  the  to-be-learned  task  in 
spacing  studies  are  different  from  the  studies  that  examine  the 
benefits  of  interleaved  versus  blocked  schedules.  Whereas  spacing 
studies  space  apart  (vs.  mass  together)  repetitions  of  the  same  item, 
interleaving  studies  tend  to  never  repeat  specific  study  exemplars. 
In  other  words,  spacing  studies  tend  to  focus  on  memory,  where- 
asinterleaving  studies  tend  to  focus  on  concept  induction.  Expla¬ 
nations  regarding  the  interleaving  benefit  have  often  referred  to  the 
discriminative-contrast  hypothesis  and  focused  on  the  juxtaposi¬ 
tion  of  exemplars  from  different  categories.  Interleaving,  however, 
inherently  includes  spacing — to  interleave  problems  of  different 
concepts  is  to  space  apart  problems  of  the  same  concept.  There¬ 
fore,  interleaving  could  be  beneficial  for  category  induction  not 
just  because  it  leads  to  enhanced  discrimination  but  also  as  a  result 
of  spacing  effects.  How  might  spacing  affect  inductive  learning? 
Inductive  learning  relies  on  several  memory  components.  Not  only 
must  learners  be  able  to  successfully  recall  features  from  previ- 
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ously  studied  exemplars  during  study,  but  they  also  must  be  able 
to  remember  what  features  map  onto  which  category.  Interestingly, 
spacing  may  make  it  difficult  to  retrieve  features  of  prior  exem¬ 
plars,  but  it  ultimately  promotes  the  feature-category  association 
(Cepeda  et  al,  2006,  2008). 

Fortunately,  a  handful  of  studies  have  attempted  to  tease  apart 
the  effects  of  discriminative-contrast  and  temporal  spacing  in 
inductive  learning.  Some  of  the  first  studies  have  tried  to  eliminate 
the  benefits  of  spacing,  to  demonstrate  that  the  benefits  of  discrim¬ 
inative  contrast  go  above  and  beyond  that  of  spacing.  For  instance, 
Taylor  and  Rohrer  (2010)  compared  an  interleaved  schedule  with 
a  blocked-spaced  schedule,  in  which  the  temporal  spacing  between 
successive  problems  of  the  same  type  was  equated.  Participants 
solved  four  different  types  of  mathematics  problems.  Each  prob¬ 
lem  was  practiced  for  10  seconds.  In  the  interleaved  condition, 
participants  studied  the  problems  all  intermixed,  so  that  there  was 
an  average  of  30  seconds  (i.e.,  three  problems  from  three  different 
types)  in  between  problems  of  the  same  type.  In  the  blocked 
condition,  there  was  30-s  filler/unrelated  task  between  sets  of  the 
same  type  of  problem  to  match  the  temporal  spacing  between 
problems  of  the  same  type  with  that  in  the  interleaved  condition. 
Performance  in  the  interleaved  condition  was  still  better  than  in  the 
blocked  condition,  suggesting  that  the  role  of  enhanced  discrimi¬ 
nation  in  the  interleaving  effect  cannot  be  explained  by  only 
temporal  spacing.  Similarly,  Kang  and  Pashler  (2012)  compared 
blocked  (aaabbbccc),  interleaved  (abcabcabc),  and  blocked-spaced 
(a--a--a--b--b--b--c--c--c)  schedules,  demonstrating 
that  increasing  the  temporal  spacing  in  the  blocked  condition  did 
not  eliminate  the  interleaving  benefit.  These  two  studies  convinc¬ 
ingly  demonstrate  that  there  are  benefits  of  interleaving  that  go 
above  and  beyond  that  of  spacing,  but  they  do  not  eliminate 
spacing  as  a  factor.  That  is,  these  two  mechanisms  are  not  mutually 
exclusive. 

The  spacing  effect  can  be  observed  when  examining  the  role  of 
temporal  spacing  across  blocked  conditions.  Bimbaum  et  al. 
(2013;  the  inductive  learning  of  butterfly  species)  did  in  fact  report 
a  blocking  benefit  (i.e.,  spacing  effect)  when  exemplars  of  the 
same  category  in  the  blocked  schedule  were  spaced  apart  with  10-s 
trivia  questions  compared  with  when  there  was  no  spacing,  al¬ 
though  the  typical  interleaved  condition  still  produced  the  highest 
performance  compared  with  the  temporally  spaced  blocked  con¬ 
dition,  consistent  with  Taylor  and  Rohrer’ s  (2010)  findings.  Bim¬ 
baum  et  al.  suggested  that  the  temporal  delay  allowed  time  for 
forgetting,  making  retrieval  of  previous  exemplars  from  memory 
more  effortful,  but  thereby  enhancing  learning  when  such  retriev¬ 
als  were  successful.  Contrary  to  the  blocking  benefit  reported  by 
Bimbaum  et  al.,  others  that  have  also  investigated  blocked  sched¬ 
ules  with  and  without  temporal  spacing,  however,  did  not  find  a 
blocking  benefit  (Kang  &  Pashler,  2012;  Zulkiply  &  Burt,  2013; 
the  inductive  learning  artists’  painting  styles). 

Recent  and  more  direct  evidence  for  the  role  of  temporal  spac¬ 
ing  in  the  interleaving  benefit  comes  from  a  study  in  which  spacing 
is  manipulated  without  disrupting  juxtaposition.  In  Bimbaum  et  al. 
(2013),  any  given  category  was  juxtaposed  against  three  other 
categories,  but  there  was  an  average  of  either  three  (small  spacing) 
or  15  intervening  exemplars  (large  spacing)  between  exemplars 
from  the  same  category.  Any  observed  differences  in  performance 
could  not  be  attributable  to  discriminative-contrast  as  the  degrees 
of  juxtaposition  did  not  vary  across  conditions,  but  rather  to 


temporal  spacing.  Consistent  with  predictions  of  the  study-phase 
retrieval  hypothesis,  they  found  that  large  spacing  produced  better 
classification  performance  compared  to  small  spacing.  In  other 
words,  not  only  is  there  a  role  of  discriminative  contrast  above  and 
beyond  that  of  spacing,  there  is  also  an  effect  of  spacing  above  and 
beyond  that  of  discriminative  contrast. 

The  spacing  inherent  in  an  interleaved  design,  therefore,  does 
contribute  to  the  interleaving  benefit,  as  long  as  spacing  does  not 
disrupt  discriminative-contrast  processes.  Blocked  schedules  may 
also  benefit  if  spacing  is  added,  however  evidence  for  this  asser¬ 
tion  is  weak  given  the  limited  number  of  studies  that  have  manip¬ 
ulated  spacing  and  discrimination  independently  in  a  2  X  2  design. 
Thus,  one  goal  of  the  current  study  is  to  examine  the  effect  of 
increasing  temporal  spacing  particularly  in  a  blocked  schedule. 
Increasing  temporal  spacing  in  the  blocked  schedule  should  enable 
a  blocking  benefit,  consistent  with  the  study-phase  retrieval  hy¬ 
pothesis.  The  spacing  between  problems  of  the  same  concept 
would  allow  for  some  forgetting  of  and  subsequent  retrieval  of 
critical  features,  strengthening  the  memory  of  the  concept’s  fea¬ 
tures.  Increasing  temporal  spacing  in  the  interleaved  schedule 
should  harm  learning,  as  the  spacing  would  interrupt  between- 
concept  comparisons  critical  to  the  interleaving  effect,  consistent 
with  the  discriminative-contrast  hypothesis. 

Overview  of  the  Present  Experiments 

The  ability  to  recognize  when  a  given  statistical  concept  applies 
to  a  research  problem  involves  having  knowledge  of  the  relations 
between  the  problem  structure  and  the  concept  features.  In  the 
present  study,  we  examined  the  different  study  sequence  condi¬ 
tions  under  which  this  type  of  concept  knowledge  is  acquired. 

In  Experiment  1 ,  we  examined  whether  the  interleaving  effect 
found  with  the  learning  of  mathematics  concepts  and  of  perceptual 
categories  also  extends  to  the  learning  of  statistical  concepts.  In  the 
remaining  experiments,  we  focused  on  two  temporal  factors — 
spacing  and  juxtaposition — that  provide  insight  into  the  cognitive 
mechanisms — comparisons  and  spaced  retrieval — that  make  the 
study  sequences  more  or  less  effective. 

In  Experiment  2,  we  examined  the  interaction  between  study 
schedules  (blocked  vs.  interleaved  schedules)  and  temporal  spac¬ 
ing  (spacing  vs.  no  spacing)  to  test  the  comparison  and  study-phase 
retrieval  hypotheses.  Increasing  temporal  spacing  in  the  inter¬ 
leaved  schedule  should  harm  classification  as  the  spacing  inter¬ 
rupts  between-concept  comparisons  critical  to  the  interleaving 
effect  (as  predicted  by  discriminative-contrast).  Increasing  tempo¬ 
ral  spacing  in  the  blocked  schedule  should  enhance  classification, 
as  the  spacing  allows  for  some  forgetting  of  and  subsequent 
retrieval  of  critical  features,  strengthening  the  memory  of  the 
concept’s  features  (as  predicted  by  study-phase  retrieval). 

In  Experiment  3,  we  examined  the  interaction  between  study 
schedules  (blocked  vs.  interleaved  schedules)  and  temporal  juxta¬ 
position  (sequential  vs.  simultaneous  sequences)  to  test  the  two 
comparison  hypotheses.  Presenting  problems  simultaneously 
rather  than  sequentially  should  enhance  between-concept  compar¬ 
isons  in  the  interleaved  schedule  (as  predicted  by  discriminative- 
contrast)  and  within-concept  comparison  in  the  blocked  schedule 
(as  predicted  by  commonality-abstraction). 

In  all  experiments,  we  were  concerned  with  the  inductive  learn¬ 
ing  of  critical  features  that  define  the  to-be-learned  statistical 
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concepts.  Participants  did  not  receive  any  explicit  instruction  or 
lessons  on  the  to-be-studied  concepts;  learning  was  essentially 
discovery-based,  and  the  focus  was  on  successfully  classifying 
never-before-seen  problems  based  on  the  concept  representations 
acquired  from  studying  the  problems  and  inducing  the  critical 
features.  We  chose  these  training  materials  because  the  appropriate 
application  of  each  of  these  concepts  relies  upon  a  conjunction  of 
features,  some  of  which  may  overlap  with  the  features  of  the  other 
concepts.  It  represents,  therefore,  a  clear  case  in  which  learners 
must  first  discover  the  critical  features  by  comparing  study  prob¬ 
lems,  and  in  which  memorizing  the  conjunction  of  features  is  not 
trivial. 

A  secondary,  and  more  exploratory,  goal  of  the  current  study 
was  to  examine  the  relation  between  participants’  cognitive  abil¬ 
ities  and  the  sequence  conditions  to  determine  whether  certain 
study  sequences  mediate  individual  differences  in  cognitive  abil¬ 
ities.  Working  memory  (WM)  is  a  stable  cognitive  trait  and  a 
well-established  predictor  of  academic  learning  (Alloway  &  Allo- 
way,  2010).  Working  Memory  Capacity  (WMC)  is  a  measure  that 
reflects  an  individual’s  ability  to  actively  maintain  and  process 
task-relevant  information  (controlled  attention)  and  to  retrieve 
related  information  from  long-term  memory  (LTM)  in  the  face  of 
distraction  (controlled  search  and  retrieval;  Baddeley  &  Hitch, 
1974;  Engle  &  Kane,  2004;  Unsworth  &  Engle,  2007;  Unsworth  & 
Spillers,  2010).  WMC  examination  can  provide  theoretical  insights 
into  the  comparison  and  retrieval  processes  involved  in  study 
sequences.  For  instance,  the  ability  to  actively  maintain  and  pro¬ 
cess  ongoing  task-relevant  information  should  determine  the  suc¬ 
cess  with  which  comparisons  are  made  between  study  problems 
(as  proposed  by  the  discriminative-contrast  hypothesis),  and  the 
ability  to  strategically  search  for  and  successfully  retrieve  problem 
features  from  LTM  should  determine  the  success  with  which 
previously  presented  problem  features  are  retrieved  at  a  later  point 
(as  proposed  by  the  study-phase  retrieval  hypothesis). 

We  included  a  single  WM  task  to  examine  whether  individual 
differences  in  WMC  predict  classification  performance  in  blocked 
and  interleaved  study  schedules.  If  those  with  higher  WMC  are,  in 
fact,  better  able  than  their  counterparts  with  lower  WMC  to  ac¬ 
tively  attend  to  and  compare  relevant  features  of  several  problems, 
and  to  strategically  retrieve  previously  studied  problem  features 
from  LTM,  then  an  interleaved  sequence  may  not  benefit  all 
participants  the  same  way.  Practically,  as  a  stable  characteristic, 
any  differences  between  higher  and  lower  WMC  participants  will 
have  dirfect  implications  for  tailoring  instruction  in  classrooms. 
However,  given  that  the  examination  between  WMC  and  study 
sequences  in  the  current  study  is  exploratory,  we  limit  the  inter¬ 
pretation  of  the  results,  which  are  speculative  at  best,  and  caution 
readers  to  do  the  same. 

Experiment  1 

The  goal  of  Experiment  1  was  to  examine  whether  an  inter¬ 
leaved  study  schedule  promotes  the  inductive,  classification  learn¬ 
ing  of  statistical  concepts.  Participants  were  presented  with  several 
study  problems  of  three  statistical  concepts  that  were  blocked  by 
concept  or  interleaved  with  problems  of  other  concepts.  With  no 
lessons  or  descriptions  of  these  concepts  provided  beforehand, 
they  had  to  study  the  problems  closely  to  extract  the  critical 
features  diagnostic  of  the  concepts.  On  a  final  classification  test, 


participants  identified  the  statistical  concept  that  could  best  be 
applied  to  never-before-studied  test  problems.  They  all  completed 
the  OSPAN  task  after  the  final  test. 

We  predicted  that  interleaving  would  produce  better  classifica¬ 
tion  performance  than  blocking  for  two  reasons:  First,  juxtaposing 
problems  of  different  concepts  in  the  interleaved  schedule  would 
promote  comparison  between  the  concepts,  as  features  abstracted 
from  the  immediately  prior  problem  would  still  be  active  in  WM 
and  likely  available  for  further  processing  and  integration 
(discriminative-contrast  hypothesis).  Second,  the  temporal  spacing 
between  problems  of  the  same  concept  in  the  interleaved  schedule 
would  promote  forgetting  of  related  features,  leading  to  more 
effortful  retrieval  of  these  features  from  LTM  the  next  time  an¬ 
other  problem  of  the  same  concept  is  presented.  This  forgetting 
and  subsequent  retrieval  across  problem  presentations  would 
strengthen  the  learning  of  that  concept  and  its  features  (study- 
phase  retrieval  hypothesis). 

We  also  examined  the  relation  between  participants’  WMC  and 
classification  performance  in  each  study  schedule,  for  which  we 
had  no  strong  a  priori  predictions,  except  that  participants  with 
higher  WMC  may  benefit  from  interleaving  more  than  participants 
with  lower  WMC,  as  they  are  generally  better  at  allocating  their 
attention  to  relevant  features,  at  resisting  interference  during  en¬ 
coding,  and  at  engaging  in  controlled  LTM  search  for  and  retrieval 
of  relevant  features  that  are  no  longer  active  in  WM  (Unsworth  & 
Engle,  2007). 

Method 

Participants.  One  hundred  twenty-six  first  year  undergradu¬ 
ate  students  (95  females;  M  age  =  18.62  years,  SD  =  1.88) 
enrolled  in  introductory  psychology  from  McMaster  University 
participated  in  the  experiment  in  exchange  for  a  course  credit. 
There  were  59  participants  in  the  blocked  condition  and  67  par¬ 
ticipants  in  the  interleaved  condition.1  There  were  no  significant 
differences  in  age  or  statistical  background  between  the  two  con¬ 
ditions.  Six  additional  participants  completed  the  experiment,  but 
were  excluded  from  the  analysis  for  indicating  previous  knowl¬ 
edge  of  the  statistical  concepts  being  tested  in  the  experiment,  or 
scoring  below  chance  (i.e„  below  33%)  on  the  final  test.  The 
remaining  participants’  data  were  kept  for  analyses  as  they  com¬ 
pleted  all  the  phases  of  the  experiment  and  scored  above  chance 
(i.e.,  above  33%)  on  the  final  test. 

Materials.  All  materials  were  modified  from  an  undergradu¬ 
ate  statistics  textbook  (Gravetter  &  Wallnau,  2008).  Participants 
learned  three  nonparametric  statistical  concepts:  chi-square  test, 
Kruskal-Wallis  test,  and  Wilcoxon  signed-ranks  test.  Each  of  the 
three  concepts  was  illustrated  with  six  different  study  problems 
with  research  design  descriptions  for  each  one  of  the  three  statis¬ 
tical  tests  would  be  appropriate),  none  of  which  included  worked- 
out  solutions  (see  the  Appendix).  Table  1  lists  the  four  defining 
features  for  each  concept  embedded  in  the  study  problems.  The 


1  Prior  studies  have  found  medium  and  medium-large  effect  sizes.  In 
Experiment  1,  using  a  medium  effect  size  (d  =  .50),  we  calculated  the 
sample  size  required  to  achieve  a  power  of  0.80  and  alpha  of  0.05, 
two-tailed,  was  64  in  each  condition.  From  Experiment  1,  we  found  that  the 
effect  size  was  larger  than  expected  ( d  =  .69).  For  Experiments  2  and  3, 
therefore,  we  aimed  to  have  34  participants  per  condition  to  achieve  the 
same  power  of  0.80. 
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Table  1 

Structural  Features  of  Three  Statistical  Concepts  Used  in  All  Experiments 


Statistics  test 

IV  (no.  of 
groups) 

Structural  features 

Sample 

DV 

Main  question 

Kruskal-Wallis  test 

3  + 

Independent 

Quantitative 

Are  there  any  differences  between  three  or  more  conditions? 

Wilcoxon  signed-rank  test 

1 

Dependent 

Quantitative 

Is  there  a  significant  change  in  a  condition  after  some  treatment? 

Chi-squared  test 

2 

Independent 

Categorical 

Is  there  a  relationship  between  two  variables? 

content  of  these  problems  was  similar  in  structure,  length,  and 
difficulty.  Interleaving  has  been  shown  to  be  most  advantageous 
for  categories  with  low  discriminability  (Zulkiply  &  Burt,  2013); 
materials  from  the  current  studies  can  be  classified  (although  not 
objectively)  as  having  low  to  moderate  discriminability  as  the 
defining  dimensions  of  each  statistical  concept  have  multiple  pos¬ 
sible  features  with  no  one  feature  that  can  discern  the  concepts. 

In  addition  to  the  18  study  problems,  there  were  a  total  of  nine 
test  problems  (three  for  each  concept)  on  the  classification  test  (see 
Appendix).  Participants  had  to  identify  which  of  the  studied  con¬ 
cepts  best  represented  a  given  test  problem.  These  problems  were 
more  complex  than  the  ones  in  the  study  phase  in  two  ways:  there 
was  no  one  defining  feature  that  cued  participants  to  the  correct 
response,  and  they  had  to  have  learned  at  least  two  of  the  four 
diagnostic  features  of  a  concept  in  order  to  get  the  correct  re¬ 
sponse.  Each  test  problem  included  at  least  one  distractor  feature 
of  a  nontarget  concept. 

All  study  and  test  problems  had  different  storylines.  To  ensure 
that  each  problem  represented  only  one  of  the  three  concepts  and 
included  all  features  defining  the  target  concept,  two  raters  (PhD 
candidates  from  the  Department  of  Statistics,  McMaster  Univer¬ 
sity)  independently  classified  the  problems  based  on  the  concept 
they  illustrate.  We  used  Cronbach’s  alpha  (r  =  .76)  to  calculate  the 
interrater  agreement,  and  made  revisions  to  the  problems  with  low 
item  agreement. 

We  used  the  Automated  Operation  Span  task  (OSPAN;  Un- 
sworth  et  al.,  2005)  to  measure  participants’  working  memory 
capacity  (WMC) — their  ability  to  simultaneously  process  and  store 
information.  During  the  task,  a  mathematical  operation  was  pre¬ 
sented  on  the  screen  (e.g.,  “Does  (4/2)  +  1  =  6?”),  and  partici¬ 
pants  pressed  a  key  to  indicate  whether  the  equation  was  correct  or 
incorrect.  All  responses  and  response  latencies  were  recorded. 
Following  this  answer  selection,  a  letter  was  presented  for  800 
msec  (e.g.,  ‘F’).  After  a  series  of  two  to  five  operation-letter  pairs, 
participants  were  asked  to  recall  the  list  of  two  to  five  letters  in  the 
order  they  were  presented.  Each  participant  was  presented  with 
three  sets  of  each  length.  A  response  was  counted  as  correct  only 
if  the  letter  was  in  the  correct  serial  order,  and  the  letter  itself  was 
correctly  recalled.  The  OSPAN  score  was  the  sum  of  recalled 
letters  for  all  sets  recalled  correctly  (completely  and  in  order  of 
presentation)  with  possible  scores  ranging  from  0  to  42."  Full 
details  of  the  task  structure,  timing,  and  scoring  can  be  found  in 
Unsworth  et  al.  (2005). 

Design  and  procedure.  In  this  1-hr  experiment,  participants, 
tested  individually,  completed  the  study  phase  and  test  phase, 
separated  by  a  2-min  distractor  task,  and  then  completed  the 
OSPAN  task.  In  the  study  phase,  participants  were  to  learn  three 
basic  concepts,  each  of  which  would  be  illustrated  with  a  series  of 


short  word  problems.  They  were  instructed  to  carefully  study  these 
problems  and  try  to  identify  the  features  diagnostic  of  the  concepts. 
Participants  were  not  explicitly  told  what  dimensions  defined  a 
concept  leaving  them  to  discover  on  their  own  what  features  of  the 
problems  were  most  relevant.  In  the  test  phase,  they  would  be 
presented  with  a  series  of  new  word  problems  and  their  task  would 
be  to  identify  the  concept  that  can  best  be  applied  to  solve  each 
problem. 

We  manipulated  study  schedule  (blocked  vs.  interleaved) 
between-subjects.  In  the  blocked  condition,  study  problems  were 
blocked  by  concept  (e.g.,  AAAAAABBBBBBCCCCCC),  such 
that  a  number  of  problems  for  a  given  concept  would  appear 
consecutively.  In  the  interleaved  condition,  study  problems  of 
different  concepts  were  interleaved  (e.g.,  ABCABCABCABCAB- 
CABC),  such  that  no  two  problems  of  a  given  concept  appeared 
consecutively.  For  both  conditions,  each  problem  was  presented 
for  20  s,  one  at  a  time,  with  a  1-s  blank  screen  in  between 
presentations.  The  name  of  the  corresponding  statistical  concept 
appeared  directly  above  each  problem.  After  the  study  phase, 
participants  played  a  game  (minesweeper)  for  two  minutes  fol¬ 
lowed  by  a  self-paced  classification  test.  Test  problems,  randomly 
ordered,  were  presented  one  at  a  time  with  three-alternative  forced 
choice  options  of  the  target  concepts.  Participants  selected  the 
concept  that  best  applied  to  a  problem.  No  feedback  was  provided. 

After  completing  the  classification  test,  they  played  a  video 
game  (bejeweled)  for  two  minutes,  followed  by  instructions  on 
completing  the  OSPAN  task.  They  were  told  that  this  task  required 
participants  to  solve  a  series  of  math  problems  while  trying  to 
remember  a  sequence  of  unrelated  letters,  ranging  from  two  to  five 
letters  in  length.  They  also  received  detailed  instructions  on  screen 
and  practiced  the  task  with  feedback  before  it  began.  Once  they 
completed  the  OSPAN  task,  participants  were  debriefed  and  dis¬ 
missed. 

Results  and  Discussion 

Classification  performance.  We  examined  the  effect  of 
schedule  (blocked  vs.  interleaved)  on  classification  performance 
using  a  between-subjects  analysis  of  variance  (ANOVA).  Consis¬ 
tent  with  the  prior  studies,  the  interleaved  condition  ( M  =  .82, 
SD  =  .17)  yielded  significantly  better  classification  performance 


2  We  used  the  absolute  scoring  methodology  (sum  of  all  sets  in  which  all 
letters  were  recalled  in  the  correct  serial  order)  for  all  analyses.  Nonethe¬ 
less,  partial  scoring  methodology  (sum  of  letters  recalled  in  the  correct 
serial  order,  regardless  of  whether  the  entire  set  was  recalled  correctly) 
resulted  in  the  same  significant  pattern  of  performance  as  that  reported  in 
the  text. 
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than  the  blocked  condition  (M  =  .70,  SD  =  .18),  F(  1,  124)  — 
12.74,  MSE  =  .03,  p  =  .001,  vfc  =  .09. 

Individual  differences.  A  multiple  regression  analysis  using 
enter  method  was  employed  to  examine  the  extent  to  which  study 
sequence  and  WM  scores  predicted  performance  on  the  classifi¬ 
cation  test.  Scores  on  WM  task  were  mean-centered  and  schedule 
was  dummy  coded  (blocked  coded  as  —1;  interleaved  coded  as  1), 
and  their  interaction  was  computed  using  the  two  variables.  Vari¬ 
ables  were  entered  in  two  blocks:  main  effects  followed  by  main 
effects  and  the  interaction.  The  results  from  the  last  block  of  the 
analyses  yielded  a  significant  overall  regression,  F( 3,  122)  = 
11.20,  p  <  .001,  adj.  R2  =  .197,  with  significant  main  effects  of 
schedule  ((3  =  .29,  t(122)  =  3.67,  p  <  .001)  and  WM  score  ((3  = 
.56,  t(122)  =  4.11,  p  <  .001),  and  a  significant  interaction  between 
schedule  and  WM  score  ((3  =  .34,  ?(122)  =  2.54,  p  =  .012). 

We  examined  whether  differences  in  classification  performance 
between  blocked  (M  =  22.01,  SD  =  5.94)  and  interleaved  (M  — 
22.42,  SD  —  7.36)  conditions  varied  as  a  function  of  participants’ 
WMC.  Figure  1  shows  the  linear  regression  lines  of  the  two  study 
schedules  as  they  relate  to  classification  performance  and  WMC. 
Classification  performance  on  the  final  test  increases  as  partici¬ 
pants’  WM  scores  increase,  with  a  steeper  slope  in  the  blocked 
condition  than  in  the  interleaved  condition. 

The  linear  regression  analyses  demonstrate  that  participants’ 
WMC  predicted  their  classification  performance  in  the  blocked 
condition,  r(  1 ,57)  =  4.27,  p  <  .001,  R2  —  .24.  This  schedule 
produced  nonoptimal  classification  performance,  particularly  for 
lower  WMC  participants.  Participants’  WMC  did  not  predict  their 
classification  performance  in  the  interleaved  condition,  f(  1 ,65)  = 
1.29,  p  =  .20,  suggesting  that  differences  in  cognitive  ability  may 
be  mitigated  with  an  interleaved  schedule — lower  WMC  partici¬ 
pants  profited  more  than  higher  WMC  participants  from  studying 
problems  of  different  concepts  together. 
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Figure  1.  Linear  regression  slopes  for  classification  performance  and 
working  memory  scores  as  a  function  of  study  schedule,  in  Experiment  1 . 


Experiment  2 

The  goal  of  Experiment  2  was  to  examine  the  interaction  be¬ 
tween  study  schedules  (blocked  vs.  interleaved)  and  temporal 
spacing  (spacing  vs.  no-spacing)  on  the  inductive  learning  of 
statistical  concepts.  The  discriminative-contrast  hypothesis  pro¬ 
poses  that  juxtaposition  of  problems  of  different  concepts  fosters 
between-concept  comparisons.  The  study-phase  retrieval  proposes 
that  although  distributing  problems  of  the  same  concept  across 
time  fosters  forgetting  of  problem  features,  it  is  the  subsequent 
effortful,  but  successful,  retrieval  of  those  features  that  strengthens 
their  memory  traces.  Each  time  a  problem  is  presented,  both 
comparison  and  retrieval  processes  are  likely  at  play. 

We  predicted  that  adding  temporal  spacing  to  an  interleaved 
sequence  should  decrease  classification  performance.  Specifically, 
when  study  problems  are  temporally  spaced  apart  with  unrelated 
fillers,  the  interleaving  benefit  may  be  eliminated  because  the 
addition  of  the  fillers  would  disrupt  the  discriminative-contrast 
processing  necessary  to  produce  the  benefit  for  understanding 
concept  boundaries.  With  respect  to  spacing,  interleaving  without 
temporal  spacing  inherently  includes  spacing  between  problems  of 
the  same  concept.  Adding  in  additional  spacing,  therefore,  may 
make  it  too  difficult  to  recall  features  of  prior  problems  of  the  same 
concept  (i.e.,  unsuccessful  retrieval  may  occur).  Evidence  to  sup¬ 
port  this  prediction  is  weak,  as  only  two  studies  have  directly 
compared  interleaved  conditions  with  and  without  temporal  spac¬ 
ing.  Birnbaum  et  al.  (2013)  found  that  adding  spacing  impaired 
category  learning,  and  Zulkiply  and  Burt  (2013)  also  reported  a 
similar  finding,  although  their  comparison  failed  to  reach  signifi¬ 
cance  (see  Figure  2a  for  a  visual  comparison  of  the  results  of  these 
two  studies). 

We  also  predicted  that  adding  temporal  spacing  to  a  blocked 
sequence  should  improve  classification  performance.  When  study 
problems  are  temporally  spaced  apart  with  unrelated  fillers,  we 
should  observe  the  spacing  effect.  The  effect  on  induction  for 
increasing  temporal  spacing  in  blocked  conditions  has  not  been 
extensively  examined  in  the  literature,  although  there  is  weak 
evidence  to  support  our  prediction.  Of  the  three  studies  that  have 
compared  blocked  schedules  with  and  without  spacing,  one  re¬ 
ported  a  blocking  benefit,  consistent  with  the  study-phase  retrieval 
hypothesis  (Birnbaum  et  al.,  2013),  but  the  other  two  did  not  find 
a  blocking  benefit — both  conditions  yielded  similar,  and  nonopti¬ 
mal  learning  gains  (Kang  &  Pashler,  2012;  Zulkiply  &  Burt,  2013; 
see  Figure  2a  for  a  visual  comparison  of  the  results  across  the 
studies).  One  possible  explanation  for  the  discrepancy  in  the  lim¬ 
ited  findings  may  be  related  to  the  nature  of  categories  used.  There 
is  some  evidence  to  suggest  that  blocked  schedules  encourage 
explicit  hypothesis  testing,  which  is  most  favorable  to  rule-based 
or  feature-based  categories  and  concepts  (Noh,  Yan,  Bjork  & 
Maddox,  2016).  Although  it  is  unclear  the  extent  to  which  artists’ 
painting  styles  (Kang  &  Pashler,  2012;,Zulkiply  &  Burt,  2013)  and 
butterfly  species  (Birnbaum  et  al.,  2013)  are  well  defined,  the 
statistical  concepts  used  in  the  current  study  are  most  certainly 
rule-based,  with  a  relatively  heavy  memory  component,  suggesting 
that  the  temporal  spacing  manipulation  should  be  sensitive  in 
detecting  any  learning  benefits  in  the  blocked  sequence. 

Finally,  we  also  explored  the  relation  between  participants’ 
WMC  and  classification  performance  in  each  study  schedule.  We 
speculated  participants  with  lower  WMC  to  be  most  affected  by 
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Figure  2.  (a)  The  proportion  of  new  problems  (or  exemplars)  correctly  classified  on  the  final  test  as  a  function 

of  study  schedule,  with  the  exception  of  Taylor  and  Rohrer  (2010)  in  which  the  proportion  of  new  problems  were 
correctly  solved  on  the  final  test,  (b)  The  proportion  of  new  problems  correctly  classified  on  the  final  test  as  a 
function  of  study  schedule,  in  Experiment  2.  Error  bars  represent  standard  error  of  the  mean. 


the  fillers  added  to  the  interleaved  study  sequence — they  are  more 
susceptible  to  disruptions  than  higher  WMC  participants  (Kane  & 
Engle,  2000;  Rosen  &  Engle,  1998),  which  may  affect  the  success 
with  which  they  attend  to  and  compare  across  concepts’  features, 
and  retrieve  relevant  features  from  LTM  in  the  face  of  distractions 
(i.e.,  unrelated  fdlers). 

Method 

Participants.  One  hundred  thirty-seven  first  year  undergrad¬ 
uate  students  (94  females;  M  age  =  18.31  years,  SD  =  .99) 
enrolled  in  introductory  psychology  from  McMaster  University 
participated  in  the  experiment  in  exchange  for  course  credit.  There 
were  35  participants  each  in  the  blocked-sequential  condition,  and 
34  participants  each  in  the  interleaved-sequential  condition  and  the 
blocked-spaced  condition,  and  the  interleaved-spaced  condition. 


There  were  no  significant  differences  in  age  or  statistical  back¬ 
ground  between  the  two  conditions.  In  addition  to  the  137  partic¬ 
ipants,  two  participants  completed  the  experiment  but  were  ex¬ 
cluded  from  the  analysis  for  either  indicating  previous  knowledge 
of  the  statistical  concepts  being  tested  in  the  experiment  or  for 
scoring  below  chance  (i.e.,  below  33%)  on  the  final  test.  The 
remaining  participants’  data  were  kept  for  analyses  as  they  com¬ 
pleted  all  the  phases  of  the  experiment  and  scored  above  chance  on 
the  final  test. 

Materials,  design,  and  procedure.  Participants  were  ran¬ 
domly  assigned  to  one  of  four  study  conditions:  blocked-sequential, 
interleaved-sequential,  blocked-spaced,  and  interleaved-spaced.  The 
blocked-sequential  condition  and  interleaved-sequential  condition 
were  identical  to  the  blocked  and  interleaved  conditions,  respectively, 
in  Experiment  1;  in  the  former,  study  problems  were  blocked  by 
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concept  and  in  the  latter,  study  problems  cycled  through  the  three 
concepts  with  no  two  problems  of  the  same  concept  appearing  con¬ 
secutively.  The  blocked-spaced  condition  and  interleaved-spaced  con¬ 
ditions  were  identical  to  their  sequential  counterpart  conditions  with 
the  exception  that  an  unrelated  30-s  cartoon  comic  was  inserted  in 
between  successive  problem  presentations.  For  all  conditions,  each 
study  problem  was  accompanied  with  the  concept’s  name  directly 
above  it. 

Results  and  Discussion 

Classification  performance.  We  conducted  a  2  (schedule: 
blocked  vs.  interleaved)  X  2  (spacing:  sequential  vs.  spaced) 
between-subjects  ANOVA  to  examine  the  effects  of  schedule  and 
spacing  on  classification  performance.  As  illustrated  in  Figure  2b, 
performance  was  significantly  better  in  the  interleaved  conditions 
(M  =  .76.  SD  =  .19)  than  in  the  blocked  conditions  (M  —  .69, 
SD  =  .20),  as  indicated  by  a  main  effect  of  schedule,  F(l,  133)  = 
4.64,  MSE  —  .03,  p  —  .033,  vfc  =  .04.  There  was  no  main  effect 
of  spacing  (F  <  1).  Of  particular  interest,  however,  was  the 
significant  interaction  between  schedule  and  spacing,  F(l,  133)  = 
8.80,  p  =  .004,  T)p  =  .06. 

Post  hoc  comparisons  revealed  that  when  study  problems  were 
presented  sequentially  (without  spacing),  interleaved  study  ( M  = 
.81,  SD  —  .18)  produced  better  performance  than  blocked  study 
(M  =  .65,  SD  =  .19),  F(l,  133)  =  13.20,  p  <  .001,  i)2p  =  .09. 
However,  when  spacing  was  inserted,  there  was  no  difference  in 
performance  between  the  blocked  (M  =  .75,  SD  =  .19)  and 
interleaved  schedules  ( M  =  .72,  SD  =  .17;  F  <  1),  consistent  with 
the  hypothesis  that  inserting  fillers  between  exemplars  prevents 
discrimination  processing.  In  fact,  with  the  spacing  between  study 
problems,  participants  in  the  blocked  schedule  numerically,  but  not 
statistically  significantly,  outperformed  those  in  the  interleaved 
schedule. 

In  line  with  our  prediction,  we  also  observed  a  marginally 
significant  decrease  in  classification  performance  when  spacing 
was  added  to  the  interleaved  schedule,  F(l,  133)  =  3.52,  p  =  .063, 
i]p  =  .03.  Bimbaum  et  al.  (2013)  reported  a  similar  but  significant 
decrement  between  the  interleaved  conditions  when  spacing  was 
added.  Together,  this  may  be  suggestive  of  evidence  for  the  role  of 
between-concept  comparisons  in  explaining  the  interleaving  effect, 
as  comparisons  would  be  more  difficult  or  disruptive  in  interleaved 
schedules  with  temporal  spacing. 

As  predicted,  in  the  blocked  schedules,  inserting  spacing  be¬ 
tween  study  problems  produced  better  performance  than  present¬ 
ing  problems  consecutively  (no-spacing),  F(l,  133)  =  5.39,  p  = 
.022,  t }p  =  .04.  This  result  is  in  line  with  the  spacing  effect 
literature  (i.e.,  study-phase  retrieval  hypothesis;  Cepeda  et  al., 
2006,  2008),  and  consistent  with  the  blocking  benefit  reported  by 
Birnbaum  et  al.  (2013).  In  the  temporally  spaced  blocked  schedule, 
the  temporal  spacing  between  problems  of  the  same  concept  likely 
promoted  forgetting,  leading  to  more  effortful  retrieval,  which 
strengthened  the  learning  of  that  concept.  Consistent  with  findings 
by  Bimbaum  et  al.  (2013)  and  Taylor  and  Rohrer  (2010),  we  found 
a  benefit  of  interleaving  (without  spacing)  over  the  blocked  tem¬ 
porally  spaced  condition,  although  it  was  not  statistically  signifi¬ 
cant. 

Individual  differences.  A  multiple  regression  analysis  using 
enter  method  was  employed  to  examine  the  extent  to  which  study 


sequence,  temporal  spacing  and  WM  scores  predicted  performance 
on  the  classification  test.  Scores  on  WM  task  were  mean-centered, 
schedule  and  spacing  variables  were  dummy  coded  (Schedule: 
blocked  coded  as  —1,  interleaved  coded  as  1;  Spacing:  sequential 
coded  as  —1,  spaced  coded  as  1),  and  all  interaction  terms  were 
computed  using  these  variables.  Variables  were  entered  in  five 
blocks:  each  block  included  an  additional  interaction  term  starting 
with  main  effects.  The  results  from  the  last  block  of  the  analyses 
yielded  a  significant  overall  regression,  F(7,  129)  =  6.56,  p  < 
.001,  adj.  R2  =  .222. 

The  analyses  also  revealed  significant  main  effects  of  schedule, 
(3  =  .22,  ?( 1 29)  =  2.86,  p  =  .005,  and  WM  score,  p  =  .43, 
r(  1 29)  =  5.31, p  <  .001,  a  significant  two-way  interaction  between 
schedule  X  spacing,  p  =  .21,  /( 1 29)  =  2.69,  p  =  .008,  and  a 
marginally  significant  three-way  interaction  between  schedule  X 
spacing  X  WM  score,  p  =  .14,  r(129)  =  1.80,  p  =  .073. 

We  regressed  participants’  WM  scores  on  their  classification 
performance  for  each  schedule  condition:  blocked-sequential 
( M  =  22.14,  SD  =  5.80),  interleaved-sequential  (M  =  22.89, 
SD  =  8.85),  blocked-spaced  (M  =  22.15,  SD  =  6.55),  and 
interleaved-spaced  ( M  —  19.26,  SD  =  6.94),  as  illustrated  in 
Figure  3,  to  examine  whether  WMC  predicted  performance  in  a 
given  sequence  condition.  As  observed  in  Experiment  1,  partici¬ 
pants’  WMC  predicted  their  classification  performance  in  the 
blocked-sequential  condition,  r(l,33)  =  4.21,  p  <  .001,  R2  =  .35, 
but  not  in  the  interleaved-sequential  condition,  /( 1 ,32)  =  1.89,  p  = 
.067,  which  suggests  that  interleaving  mitigates  differences  in 
cognitive  ability. 

Participants’  WMC  did  not  predict  their  classification  perfor¬ 
mance  in  the  blocked-spaced  condition,  r(  1 ,32)  =  1.67,  p  =  .105, 
suggesting  that  distributing  study  problems  of  a  concept  across 
time  can  decrease  performance  differences  between  higher  and 
lower  WMC  individuals.  Participants’  WMC  predicted  their  clas¬ 
sification  performance  in  the  interleaved-spaced  condition, 
r(l,32)  =  3.01,  p  =  .005,  R2  =  .22,  which  was  arguably  the  most 
disruptive  and  cognitively  demanding  condition. 
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Figure  3.  Linear  regression  slopes  for  classification  performance  and 
working  memory  scores  as  a  function  of  study  schedule  and  temporal 
spacing,  in  Experiment  2. 
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Experiment  3 

The  goal  of  Experiment  3  was  to  examine  the  interaction  be¬ 
tween  study  schedules  (blocked  vs.  interleaved  schedules)  and 
temporal  juxtaposition  (simultaneous  vs.  sequential  sequences)  on 
the  inductive  learning  of  statistical  concepts.  Study  problems  in 
Experiments  1  and  2  were  presented  one-at-a-time  (i.e.,  sequen¬ 
tially).  Viewing  multiple  problems  of  concepts  at  the  same  time 
may  reduce  the  memory  constraints,  and  provide  a  more  explicit 
learning  context  to  contrast  the  critical  differences  between  con¬ 
cepts  in  the  interleaved  schedule,  and  to  connect  the  critical  com¬ 
monalities  of  a  given  concept  in  the  blocked  schedule  (Kang  & 
Pashler,  2012;  Mundy,  Honey,  &  Dwyer,  2007).  Thus,  we  predict 
classification  performance  for  concepts  learned  in  simultaneous 
sequences  to  be  better  than  for  concepts  learned  in  sequential 
sequences.  This  should  hold  true  for  blocked  schedules  (consistent 
with  similarity-abstraction  hypothesis)  and  for  interleaved  sched¬ 
ules  (consistent  with  discriminative-contrast  hypothesis). 

There  are  limited  studies  that  have  examined  the  effect  of 
simultaneous  versus  sequential  study  sequences  on  category  in¬ 
duction  under  different  study  schedules.  When  comparing  inter¬ 
leaved  schedules,  Kang  and  Pashler  (2012;  artists’  painting  styles) 
found  no  learning  benefit  or  impairment  for  simultaneous  over 
sequential  sequences,  whereas  Wahlheim  et  al.  (2011;  bird  fami¬ 
lies)  did  find  a  learning  benefit  for  simultaneous  sequence.  How¬ 
ever,  the  latter  study  failed  to  show  the  typical  interleaving  effect 
under  sequential  sequences  when  learning  the  relatively  difficult 
stimuli  of  bird  families,  and  thus,  may  not  be  a  fair  comparison 
with  the  current  experiment.  Nonetheless,  neither  of  the  studies 
showed  benefits  for  simultaneous  sequences  in  blocked  schedules. 

On  the  other  hand,  studies  on  analogical  transfer  have  shown 
clear  learning  gains  when  multiple  items  of  the  same  category  are 
compared  simultaneously  rather  than  studied  separately  (e.g.,  Ca- 
trambone  &  Holyoak,  1989;  Gentner,  Loewenstein,  Thompson,  & 
Forbus,  2009;  Gick  &  Holyoak,  1983;  Star  &  Rittle-Johnson, 
2009).  For  instance,  participants  demonstrated  greater  learning  of 
negotiation  principles  when  allowed  to  compare  two  case  studies 
for  a  given  negotiation  strategy  side-by-side  rather  than  when  these 
case  studies  were  studied  separately  (Gentner  et  al.,  2003).  Simul¬ 
taneously  presenting  problems  fosters  a  more  direct  comparison, 
which  in  turn  helps  learners  overcome  contextual  limitations  and 
allows  them  to  recognize  the  common  deep  features  (e.g.,  Catram- 
bone  &  Holyoak,  1989;  Markman  &  Gentner,  2005).  Moreover, 
problems  that  differ  in  their  surface  features  (i.e.,  cover  stories, 
storylines,  events,  names,  objects)  but  share  similar  structural 
features  (i.e.,  principles,  equations,  procedures),  as  is  the  case  with 
our  statistics  stimuli,  can  further  enable  this  comparison  process  as 
they  more  quickly  realize  what  features  are  and  what  features  are 
not  relevant  for  categorization  (e.g.,  Quilici  &  Mayer,  1996,  2002). 

Finally,  we  explored  whether  participants’  WMC  would  predict 
their  classification  performance  in  either  of  the  simultaneous  study 
sequences.  It  is  possible  that  this  simultaneous  sequence  would 
reduce  memory  demands,  which  would  be  particularly  helpful  for 
lower  WMC  participants  who  may  be  more  susceptible  to  noticing 
irrelevant  features  (e.g.,  problems’  cover  stories;  Unsworth  & 
Engle,  2007).  Studying  the  problems  three-at-a-time  would  allow 
them  to  allocate  their  attention  to  search  for  the  relevant  concept 
features,  whether  it  is  features  shared  by  a  concept  in  the  blocked 


sequence  or  features  that  differ  across  concepts  in  the  interleaved 
sequence. 

Method 

Participants.  One  hundred  thirty-five  first-year  undergradu¬ 
ate  students  (97  females;  M  age  =  18.94  years,  SD  =  2.40) 
enrolled  in  introductory  psychology  from  McMaster  University 
participated  in  the  experiment  in  exchange  for  1  course  credit. 
There  were  33  participants  each  in  the  blocked-sequential  and 
interleaved-sequential  conditions,  34  in  the  blocked-simultaneous 
condition,  and  35  in  the  interleaved-simultaneous  condition.  There 
were  no  significant  differences  in  age  or  statistical  background 
between  the  two  conditions.  All  participants’  data  were  kept  for 
analyses  as  they  completed  all  the  phases  of  the  experiment  and 
scored  above  chance  (i.e.,  above  33%)  on  the  final  test. 

Materials,  design  and  procedure.  The  procedure  for  Experi¬ 
ment  3  was  nearly  identical  to  that  of  Experiment  1.  The  only 
difference  was  the  addition  of  two  more  between-subjects  conditions 
(a  total  of  four  study  conditions).  Participants  were  randomly  assigned 
to  one  of  four  study  conditions:  blocked-sequential,  interleaved- 
sequential,  blocked-simultaneous,  and  interleaved-simultaneous.  The 
blocked-sequential  condition  and  interleaved-sequential  condition 
were  identical  to  the  blocked  and  interleaved  conditions  in  Experi¬ 
ment  1. 

The  blocked-simultaneous  condition  and  interleaved-simul¬ 
taneous  conditions  were  identical  to  their  sequential  counterparts 
with  the  exception  that  the  study  problems  were  presented  three- 
at-a-time  (i.e.,  on  the  same  page)  for  60-s,  with  a  3-s  blank  screen 
after  each  set  of  three  problems.  In  other  words,  total  study  time 
was  held  constant  across  the  four  conditions.  In  the  blocked- 
simultaneous  condition,  all  six  exemplar  problems  from  a  given 
concept  were  presented  before  moving  on  to  the  next  concept. 
These  six  exemplar  problems  were  presented  as  two  sets  of 
three,  simultaneously  presented  problems.  In  the  interleaved- 
simultaneous  condition,  problems  were  also  presented  in  sets  of 
three,  but  each  problem  in  a  given  set  came  from  a  different 
concept.  For  all  conditions,  each  study  problem  was  accompa¬ 
nied  with  the  concept  name  directly  above  it. 

Results  and  Discussion 

Classification  performance.  We  conducted  a  2  (schedule: 
blocked  vs.  interleaved)  X  2  (juxtaposition:  sequential  vs.  simul¬ 
taneous)  ANOVA  to  examine  the  effects  of  schedule  and  juxtapo¬ 
sition  on  classification  performance.  Figure  4  illustrates  perfor¬ 
mance  on  the  classification  test  as  a  function  of  schedule  and 
juxtaposition.  Overall,  the  interleaved  conditions  (M  =  .81,  SD  — 
.18)  yielded  better  performance  than  blocked  conditions  ( M  —  .73, 
SD  =  .19),  as  indicated  by  a  main  effect  of  schedule,  F(l,  131)  = 
5.64,  MSE  =  .03,  p  =  .019,  r\j  =  .04.  There  was  also  a  significant 
main  effect  of  juxtaposition,  F(l,  131)  =  3.79,  MSE  =  .03,  p  = 
.054,  r\p  =  .03,  with  simultaneous  sequences  (M  =  .80,  SD  =  .16) 
leading  to  better  performance  than  sequential  sequences  (M  =  .74, 
SD  —  .20).  In  other  words,  presenting  problems  three-at-a-time, 
relative  to  presenting  problems  one-at-a-time,  may  have  increased 
the  extent  to  which  participants  noticed  differences  across  prob¬ 
lems  of  different  concepts  and  noticed  similarities  across  problems 
of  the  same  concept. 


94 


SANA,  YAN,  AND  KIM 


TJ 

1 

0) 

SC 

'55 

0.9 

in 

ro 

0.8 

O 

>* 

0.7 

o 

<u 

0.6 

l_ 

o 

0.5 

o 

c 

0.4 

o 

'£ 

0.3 

o 

a 

o 

0.2 

k_ 

a. 

0.1 

Blocked 


□  Sequential 
■  Simultaneous 


Interleaved 


Figure  4.  The  proportion  of  new  problems  correctly  classified  on  the 
final  test  as  a  function  of  study  schedule,  in  Experiment  3.  Error  bars 
represent  standard  error  of  the  mean. 


The  schedule  X  juxtaposition  interaction  was  not  significant 
(F  <  1 ,  p  =  .31).  However,  we  carried  out  the  planned  comparison 
to  determine  whether  a  simultaneous  sequence  improved  classifi¬ 
cation  performance  in  the  blocked  condition.  When  schedules  were 
blocked,  simultaneously  presenting  study  problems  yielded  better 
performance  than  sequentially  presenting  study  problems,  F(l, 
131)  =  4.31,  p  =  .040,  T| p  =  .03.  Research  using  text-based  stimuli 
suggest  that  learning  is  better  when  problems  are  designed  to  allow 
for  useful  inferences  with  regard  to  their  structural  and  surface 
features  (e.g.,  Gick  &  Holyoak,  1987;  Quilici  &  Mayer,  1996).  In 
the  current  experiment,  we  presented  participants  with  three  prob¬ 
lems  at  once,  each  with  different  surface  features,  which  may  have 
promoted  interproblem  processing  (i.e.,  focusing  on  common  fea¬ 
tures)  over  intraproblem  processing  (i.e.,  focusing  on  specific 
wording  or  details;  e.g.,  Gentner  et  al.,  2003;  Gick  &  Holyoak, 
1983;  Quilici  &  Mayer,  1996,  2002).  We  also  found  that  when 
study  problems  were  presented  sequentially,  interleaved  schedule 
produced  better  performance  than  blocked  schedule,  F(l,  131)  = 
5.57,  p  =  .020,  t )p  —  .04,  replicating  the  interleaving  effect. 

Individual  differences.  A  multiple  regression  analysis  using 
enter  method  was  employed  to  examine  the  extent  to  which  study 
sequence,  temporal  juxtaposition  and  WM  scores  predicted  per¬ 
formance  on  the  classification  test.  Scores  on  WM  task  were 
mean-centered,  schedule  and  juxtaposition  variables  were  dummy 
coded  (Schedule:  blocked  coded  as  -1,  interleaved  coded  as  1; 
Juxtaposition:  sequential  coded  as  —1,  simultaneous  coded  as  1), 
and  all  interaction  terms  were  computed  using  these  variables. 
Variables  were  entered  in  five  blocks:  each  block  included  an 
additional  interaction  term  starting  with  main  effects.  The  results 
from  the  last  block  of  the  analyses  yielded  a  significant  overall 
regression,  F( 7,  127)  =  5.02,  p  <  .001,  adj.  R 2  =  .173.  The 
analyses  revealed  significant  main  effects  of  schedule,  (3  =  .19, 
r(  1 27 )  =  2.44.  p  =  .016,  juxtaposition,  (3  =  .17,  f(127)  =  2.21, 
p  =  .029,  and  WM  score,  3  =  .30,  f(  1 27)  =  3.61,  p  <  .001,  and 
significant  two-way  interactions  between  schedule  X  WM  score, 
3  =  .23,  t(127)  =  2.78,  p  =  .006,  and  juxtaposition  and  WM 
score,  3  =  -24,  r(  1 27)  =  2.36,  p  =  .020.  Follow-up  analyses  of  the 


two-way  interactions  showed  that  WM  score  predicted  classifica¬ 
tion  test  performance  in  the  blocked  conditions,  3  =  -42,  t( 65)  = 
3.1 5,  p  <  .001,  R2  =  .18,  but  not  in  the  interleaved  conditions  ( t  < 
1,  p  =  .72,  R2  =  .04),  replicating  the  general  findings  from 
Experiments  1  and  2.  WM  score  also  predicted  classification 
performance  in  the  sequential  conditions,  3  =  -39  t( 64)  =  3.38, 
p  =  .001,  R2  =  .15,  but  not  in  simultaneous  conditions,  t( 67)  = 
1.41,  p  =  .163,  R2  =  .03,  suggesting  that  juxtaposing  problems 
three-at-a-time  rather  than  one-at-a-time  decreases  performance 
differences  across  lower  and  higher  WMC  individuals.  Figure  5 
shows  the  linear  regression  lines  for  each  study  sequence:  blocked- 
sequential  (M  =  20.68,  SD  =  6.07),  interleaved-sequential  (M  = 
21.43,  SD  =  6.12),  blocked-simultaneous  (M  =  19.99,  SD  = 
8.17),  and  interleaved-simultaneous  (M  =  20.49,  SD  =  7.45),  as  a 
function  of  participants’  WMC  on  their  classification  test  perfor¬ 
mance,  albeit  no  further  analyses  were  conducted  as  the  three-way 
interaction  between  schedule  X  juxtaposition  X  WM  score,  3  = 
.12,  ?(127)  =  1.43,  p  =  .156,  was  not  significant. 

General  Discussion 

At  a  basic  level,  a  statistics  course  consists  of  several  highly 
abstract  concepts  that  the  students  must  learn.  Although  many  may 
learn  the  procedural  skills  necessary  to  correctly  calculate  test 
concepts,  students  often  struggle  with  the  conceptual  knowledge  of 
when  to  correctly  apply  the  procedure.  This  issue  is  exacerbated  by 
the  fact  that  most  textbooks  emphasize  practice  in  using  proce¬ 
dures  but  not  in  determining  which  procedures  to  use,  and  when  to 
use  them  (Mayer,  Sims,  &  Tajika,  1995).  To  acquire  conceptual 
knowledge,  one  needs  to  build  a  cognitive  structure  that  represents 
an  understanding  of  statistics  in  terms  of  links  and  relations  be¬ 
tween  the  important  concepts.  Our  studies  demonstrate  that  this 
conceptual  learning  can  be  accomplished  by  juxtaposing  study 
problems  of  different  concepts,  or  by  studying  problems  of  the 
same  concept  distributed  across  time.  Furthermore,  such  instruc¬ 
tional  strategies  may  be  especially  beneficial  for  learners  with 
lower  cognitive  abilities. 


Figure  5.  Linear  regression  slopes  for  classification  performance  and 
working  memory  scores  as  a  function  of  study  schedule  and  juxtaposition, 
in  Experiment  3. 
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Classification  Performance 

In  the  current  study,  we  specifically  examined  the  extent  to 
which  temporal  factors  (i.e.,  increasing  temporal  spacing  between 
problems,  and  presenting  problems  simultaneously)  and  cognitive 
processes  (discriminative-contrast,  commonality-abstraction,  and 
forgetting-and-retrieval)  determine  when  and  why  one  schedule 
may  be  more  or  less  effective  than  the  other. 

Findings  from  Experiment  1  demonstrate  that  the  interleaving 
benefit  generalizes  to  the  inductive  learning  of  statistical  concepts. 
Participants  were  better  able  to  identify  the  concepts  of  previously 
unseen  test  problems  when  study  problems  were  interleaved  with 
those  of  other  concepts  rather  than  blocked  by  concept. 

Results  of  Experiment  2  demonstrate  an  interaction  between 
study  schedules  and  temporal  spacing:  increasing  spacing  by  in¬ 
serting  unrelated  fillers  between  successive  study  problems  de¬ 
creased  classification  performance  in  the  interleaved  schedule, 
consistent  with  the  discriminative-contrast  hypothesis,  and  in¬ 
creased  classification  performance  in  the  blocked  schedule,  con¬ 
sistent  with  the  study-phase  retrieval  hypothesis.  Participants  were 
better  at  classifying  test  problems  when  study  problems  in  the 
interleaved  schedule  were  not  temporally  spaced  apart — the  unre¬ 
lated  fillers  presumably  disrupted  the  contrast  processes  critical  to 
between-concept  comparisons.  Participants  were  also  better  at 
classifying  test  problems  when  study  problems  in  the  blocked 
schedule  were  temporally  spaced  apart — learning  was  better  be¬ 
cause  the  temporal  delay  between  problems  allowed  time  to  forget 
the  features,  which  made  retrieval  of  those  features  from  memory 
somewhat  difficult,  but  successful. 

Results  from  Experiment  3  suggest  that  interleaving  study  se¬ 
quences  improved  overall  classification  performance,  consistent 
with  the  discriminative-contrast  hypothesis.  Relative  to  sequential 
presentations,  simultaneous  presentations  improved  overall  classi¬ 
fication  performance.  This  result  seemed  to  be  largely  driven  by  a 
difference  in  performance  between  the  two  blocked  schedules, 
consistent  with  the  commonality-abstraction  hypothesis.  Partici¬ 
pants  were  better  at  classifying  test  problems  when  study  problems 
in  the  blocked  schedule  were  presented  three-at-a-time  rather  than 
one-at-a-time.  A  simultaneous  study  sequence  may  reduce  mem¬ 
ory  load  to  provide  a  more  explicit  learning  context  to  elicit  the 
critical  commonalities  within  a  concept,  and  to  extract  the  relevant 
features  shared  across  problems,  and  disregard  the  irrelevant  sur¬ 
face  features. 

Overall,  the  interleaving  benefit  across  the  three  experiments 
supports  the  notion  that  the  interleaving  effect  is  a  general  learning 
phenomenon,  extending  into  the  learning  of  nonparametric  statis¬ 
tics  concepts.  Critically,  the  current  study  is  one  of  few  that 
suggest  potential  boundary  conditions  for  the  interleaving  effect, 
testing  complementary  and  competing  dynamics  of  discriminative- 
contrast,  commonality-abstraction,  and  temporal  spacing  in  induc¬ 
tive  learning. 

The  interplay  between  our  various  manipulations  of  study 
sequencing  (blocking  vs.  interleaving,  no-spacing  vs.  spacing, 
simultaneous  vs.  sequential)  demonstrates  the  importance  of  all 
three  processes  in  inductive  category  learning.  The  role  of 
discriminative-contrast  in  inductive  learning  is  revealed  in  the 
finding  that  interleaved-sequential  study  led  to  better  (or  at 
least,  as  good)  classification  performance  than  did  blocked 
study  (either  simultaneous,  sequential,  or  spaced;  Exps  1—3), 


and  that  interleaved-spaced  study  led  to  worse  classification 
performance,  as  a  result  of  disrupted  juxtaposition,  than  did 
interleaved-sequential  study  (Exp  2).  The  role  of  commonality- 
abstraction  in  inductive  learning  is  revealed  in  the  finding  that 
blocked-simultaneous  study  led  to  better  classification  perfor¬ 
mance  than  blocked-sequential  study  (Exp  3) — that  is,  simul¬ 
taneous  study  facilitated  the  noticing  of  within-concept  com¬ 
monalities.  Finally,  the  role  of  spacing  for  inductive  learning  is 
highlighted  in  the  finding  that  the  temporally  spaced  blocked 
study  led  to  better  classification  performance  than  the  blocked- 
sequential  study  (Exp  2). 

To  summarize,  the  three  cognitive  processes — differences  be¬ 
tween  concepts,  commonalities  within  concepts,  and  memory  of 
critical  features — are  influenced  by  manipulations  of  temporal 
juxtaposition  (sequential  vs.  simultaneous)  and  spacing  (no¬ 
spacing  vs.  spacing)  for  both  the  interleaved  and  the  blocked 
sequences.  But,  why  do  the  same  manipulations  have  different 
effects  on  learning  based  on  the  two  kinds  of  sequences?  For 
example,  in  the  current  study,  the  disruptive  juxtaposition  effects 
of  spacing  were  greater  for  interleaved  schedules  than  for  blocked 
schedules;  spacing  disrupted  contrast  processes  critical  for  learn¬ 
ing  in  the  interleaved  schedule,  but  produced  some  forgetting  and 
subsequent  retrieval  of  concept  features  in  the  blocked  schedule. 
Another  example  is  the  benefit  of  simultaneous  sequences,  which 
seemed  to  be  greater  for  blocked  schedules  than  for  interleaved 
schedules;  simultaneous  presentations  provided  a  more  explicit 
learning  context  to  elicit  the  critical  commonalities  in  the  blocked 
schedule,  but  produced  nonsignificant  learning  gains  in  the  inter¬ 
leaved  schedule.  The  current  results  contribute  to  the  importance 
of  developing  a  theoretical  framework  that  generates  testable  pre¬ 
dictions  about  the  relative  effects  of  various  manipulations  (e.g., 
sequential,  spaced  and  simultaneous)  for  learning  on  the  two 
schedules. 

Another  important  goal  of  future  research  should  be  to  closely 
examine  the  boundary  conditions  of  the  interleaving  effect,  and 
potential  factors  that  moderate  this  effect.  The  current  study  sug¬ 
gests  that  temporal  spacing  and  juxtaposition,  and  (potentially) 
individuals’  WMC  moderate  the  efficacy  of  practice  schedules. 
Another  goal  is  to  replicate  study  schedule  efficacy  under  condi¬ 
tions  that  include  direct  instruction  rather  than  just  inductive 
learning.  Rohrer  and  colleagues,  who  report  interleaving  benefits 
when  learning  mathematics  concepts,  provided  their  participants 
with  lessons  on  the  concepts  before  the  practice  sessions.  If  the 
goal  is  to  generalize  the  interleaving  effect  to  more  educationally 
realistic  settings,  it  is  important  to  replicate  the  findings  with 
statistics  concepts,  and  with  other  relevant  stimuli. 

Individual  Differences  in  WMC 

We  also  examined  the  relations  between  participants’  cognitive 
abilities  and  their  classification  performance  as  a  function  of  the 
different  study  sequences.  Across  the  three  experiments,  WMC 
predicted  classification  performance  when  the  study  sequences 
were  blocked  and  interleaved-spaced.  WMC  did  not  predict  clas¬ 
sification  performance  when  the  study  sequences  were  interleaved 
and  blocked-spaced. 

Interestingly,  the  observed  learning  detriments  and  gains  seem 
to  be  driven  particularly  by  participants  with  lower  WMC.  This 
may  be  because  individuals  with  higher  WMC  are  better  at  con- 
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trolling  their  attention  to  process  task  relevant  information,  and 
controlling  their  search  of  LTM  to  retrieve  relevant  information 
(Unsworth  &  Engle,  2007).  Moreover,  these  individuals  may  al¬ 
ready  use  efficient  strategies  to  process  information  unlike  their 
lower  WMC  counterparts  (e.g.,  Brewer  &  Unsworth,  2012),  and 
therefore,  they  may  be  less  affected  by  the  different  study  sequence 
manipulations.  There  are  several  theoretical  possibilities  that  may 
explain  why  lower  WMC  individuals  may  be  more  susceptible  to 
specific  sequence  manipulations:  (a)  the  extent  to  which  they 
successfully  make  comparisons  across  different  problems  may 
depend  on  their  ability  to  maintain  attention  to  task-relevant  in¬ 
formation  in  the  face  of  distraction,  and  (b)  the  extent  to  which 
they  successfully  retrieve  encoded  features  of  previous  concepts 
from  LTM  may  depend  on  their  ability  to  engage  in  controlled  and 
strategic  search  of  LTM  (Unsworth  &  Engle,  2007). 

The  exploratory  examination  of  relations  between  WMC  and 
various  study  sequences  should  be  interpreted  with  caution.  Al¬ 
though  it  is  clear  that  both  controlled  attention  and  search/retrieval 
components  of  WM  are  related  to  the  comparison  and  retrieval, 
processes  that  are  at  the  heart  of  blocked  and  interleaved  se¬ 
quences,  further  studies  are  warranted  to  replicate  and  more 
closely  investigate  the  theoretical  underpinnings  of  these  relations. 
Findings  from  the  current  study  suggest  that  interleaved  schedules 
and  blocked-temporally  spaced  schedule  can  decrease  learning 
differences  across  individuals  with  varying  cognitive  abilities,  and 
increase  overall  learning  for  all  individuals,  suggesting  that  these 
sequences  are  optimal  for  the  inductive  learning  of  statistical 
concepts. 

Concluding  Comments 

We  examined  the  different  factors  and  processes  that  determine 
when  and  how  one  schedule  may  be  more  or  less  effective  than  the 
other.  Interleaved  schedules  are  effective  when  problems  of  dif¬ 
ferent  concepts  are  presented  (together  or  individually)  with  no 
disruptions  in-between  the  problems.  This  seems  to  be  because 
interleaving  facilitates  between-concept  comparison  that  is  suscep¬ 
tible  to  disruptions.  Blocked  schedules  are  effective  when  there  is 
some  form  of  temporal  spacing  introduced  in-between  problems, 
as  the  temporal  lag  allows  for  some  forgetting  and  subsequent 
retrieval  of  problem  features,  memory  traces  for  which  are  then 
strengthened.  Blocked  schedules  may  also  be  effective  when  prob¬ 
lems  of  the  same  concept  are  juxtaposed  three- at-a-time,  as  it 
provides  a  contextual  learning  environment  that  allows  learners  to 
make  explicit  comparisons  and  extract  the  relevant  features  shared 
by  all  problems  from  the  irrelevant  features.  Findings  from  the 
current  study  suggest  an  interleaving  schedule  to  generally  be  an 
effective  method  of  study  and  instruction.  If  interleaving  is  im¬ 
practical  or  inconvenient,  a  blocking  schedule  may  be  structured  to 
be  effective  by  introducing  temporal  spacing  or  by  presenting 
problems  of  a  concept  simultaneously. 

Properly  constructed  study  sequences  can  promote  concept 
learning  in  the  domain  of  statistics.  Our  use  of  ecologically  valid 
materials  offers  some  confidence  in  the  generalizability  of  these 
findings  to  other  cognitive  concepts.  There  are  several  straightfor¬ 
ward  practical  ways  in  which  our  findings  can  be  implemented  in 
formal  educational  settings.  Course  assignments  and  end-of- 
chapter  practice  problems  in  textbooks  may  include  problems  not 
just  from  the  current  topic,  but  also  from  previous  units  or  chap¬ 


ters.  Instructors  can  also  lead  explicit  discussions  on  the  common¬ 
alities  and  differences  across  concepts.  This  direct  instruction  may 
offer  benefits  not  only  in  terms  of  increasing  students’  understand¬ 
ing,  but  also  in  terms  of  encouraging  students  to  search  out 
comparison  opportunities  on  their  own. 
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Appendix 

Sample  Problems 


Sample  of  Word  Problems  Used  in  the  Study  Phase 
Wilcoxon  Signed-Ranks  Test 

We  want  to  know  whether  the  acupuncture  treatment  is  effective 
to  cure  chronic  back  pain.  From  a  random  sample,  participants  are 
asked  to  rate  their  level  of  back  pain  before  the  treatment  begins, 
and  again  after  the  treatment  ends.  Is  there  a  significant  change  in 
reported  pain  after  the  acupuncture  treatment  was  given? 

Chi-Squared  Test 

We  want  to  know  whether  iPhone  users  are  more  likely  than 
Blackberry  users  to  own  Apple  laptops.  From  a  random  sample, 
participants  are  categorized  as  using  an  iPhone  or  a  Blackberry, 
and  as  owning  an  Apple  laptop  or  a  non-Apple  laptop.  Is  using  an 
iPhone  related  to  owning  an  Apple  laptop? 

Kruskal- Wallis  Test 

We  want  to  know  whether  first-born  children  are  happier  than 
their  younger  siblings.  From  a  random  sample,  participants  are 
divided  into  three  groups  based  on  whether  they  are  the  only  child, 
eldest  child,  or  youngest  child.  All  participants  report  their  level  of 
happiness.  Is  there  a  difference  in  happiness  ratings  among  the 
groups? 

Sample  of  Word  Problems  Used  in  the  Test  Phase 
Wilcoxon  Signed-Ranks  Test 

Does  drinking  alcohol  significantly  increase  the  errors  in  enun¬ 
ciation  produced  during  a  singing  performance?  Professional  sing¬ 


ers  are  asked  to  perform  their  favorite  song  before  being  served 
with  several  free  tequila  shots.  They  are  then  asked  to  perform 
another  song  of  their  choice.  Not  surprisingly,  all  singers  produced 
12  or  more  enunciation  errors  once  they  consumed  alcohol. 

Chi-Squared  Test 

Are  musicians  likely  to  be  more  knowledgeable  on  musical 
theory  than  nonmusicians?  Residents  of  Ontario,  Canada  are  di¬ 
vided  up  based  on  whether  they  are  musicians  (i.e.,  play  an 
instrument  for  10+  hrs  a  week)  or  nonmusicians  (i.e.,  do  not  play 
any  instrument),  and  whether  they  pass  or  fail  an  introductory 
musical  theory  test.  It  turns  out  that  musical  expertise  was  not 
related  to  musical  knowledge. 

Kruskal- Wallis  Test 

Does  the  choice  of  the  paint  color  in  a  classroom  affect  student 
learning?  Students  are  assigned  to  separate  classrooms  that  have 
blue,  green,  or  yellow  colored  paint  on  the  walls  and  ceilings.  They 
study  a  chapter  on  French  revolution  and  then  write  a  multiple- 
choice  test  on  the  content.  The  test  scores  did  not  vary  depending 
on  the  paint  color  of  the  classroom  in  which  students  studied. 
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Learners  often  insufficiently  monitor  their  comprehension,  which  results  in  overconfident  judgments  of 
learning  and  underachievement.  In  the  3  present  experiments,  we  investigated  whether  insufficient 
comprehension  monitoring  is  due  in  part  to  the  fact  that  learners  are  not  sufficiently  aware  of  the  benefit 
of  comprehension  monitoring  and  thus  scarcely  engage  in  this  process.  As  an  intervention,  we  informed 
learners  about  the  likely  negative  consequences  of  failing  to  monitor  their  comprehension.  Specifically, 
we  informed  them  about  the  high  frequency  of  and  the  detrimental  consequences  that  result  from  making 
overconfident  judgments  of  learning.  In  Experiment  1  we  found  that  for  university  students,  this 
intervention  increased  their  engagement  in  comprehension  monitoring,  led  to  more  cautious  judgments 
of  learning,  and  fostered  the  acquisition  of  conceptual  knowledge  in  a  subsequent  learning  phase  in  which 
they  received  instructional  explanations  relating  to  a  new  topic.  By  contrast,  this  intervention  was  less 
beneficial  for  13-  to  15-year-old  high  school  students:  Although  the  intervention  increased  their 
comprehension  monitoring  and  led  to  more  cautious  judgments  of  learning,  it  did  not  foster  the 
acquisition  of  conceptual  knowledge  from  the  subsequent  explanations  (Experiment  2).  In  Experiment  3, 
we  varied  whether  13-  to  15-year-old  high  school  students  received  (a)  information  about  the  frequency 
of  and  the  detrimental  consequences  that  result  from  making  overconfident  judgments  of  learning  and  (b) 
information  about  effective  regulation  strategies.  The  results  of  this  experiment  suggest  that  the  limited 
beneficial  effect  found  in  Experiment  2  could  be  attributed  to  a  lack  of  knowledge  regarding  effective 
regulation  strategies  for  this  age  group. 

Keywords:  comprehension  monitoring,  metacognition,  metastrategic  knowledge,  self-regulation, 
judgments  of  learning 


As  learners  often  insufficiently  monitor  their  comprehension, 
they  are  prone  to  making  overconfident  judgments  of  learning 
(JOLs).  Metacognitive  theory  of  self-regulated  learning  posits  that 
overconfident  JOLs  lead  to  underachievement.  Thus,  in  recent 
years  the  challenge  of  devising  ways  to  foster  comprehension 
monitoring  has  been  receiving  increasing  attention  in  the  fields  of 
metacognition  and  self-regulated  learning  (e.g.,  Alexander,  2013; 
De  Bruin  &  Van  Gog,  2012;  Dunlosky  &  Rawson,  2012;  Lipko  et 
al.,  2009;  Lipowski,  Merriman,  &  Dunlosky,  2013;  Thiede,  Grif¬ 
fin,  Wiley,  &  Anderson,  2010). 

A  well-established  source  of  insufficient  comprehension  moni¬ 
toring  is  that  learners  have  low  monitoring  skills.  This  low  level  of 
monitoring  skills  often  manifests  itself  in  the  form  of  inappropriate 
cue  utilization.  For  instance,  to  monitor  their  comprehension  of 
expository  texts,  learners  frequently  utilize  cues  such  as  their 
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interest  in  the  topic  or  their  ability  to  recall  the  respective  text  (e.g., 
Baker  &  Dunlosky,  2006;  Thiede  et  al.,  2010).  However,  if  utilized 
to  monitor  comprehension  these  cues  likely  lead  learners  to  make 
overconfident  JOLs.  In  light  of  this,  a  large  proportion  of  the 
interventions  to  foster  comprehension  monitoring  have  focused  on 
increasing  the  degree  to  which  learners  utilize  predictive  cues  (e.g., 
Griffin,  Wiley,  &  Thiede,  2008;  Gutierrez  &  Schraw,  2015;  Red- 
ford,  Thiede,  Wiley,  &  Griffin,  2012;  Thiede  &  Anderson,  2003; 
Thiede,  Anderson,  &  Therriault,  2003;  cf.  Desoete,  Roeyers,  &  De 
Clercq,  2003;  Nietfeld,  Cao,  &  Osborne,  2006). 

However,  although  these  interventions  have  yielded  promising 
results,  low  monitoring  skills  may  not  be  the  only  reason  for 
insufficient  comprehension  monitoring;  a  low  will  to  engage  in 
comprehension  monitoring  might  be  a  reason  as  well.  By  roughly 
the  age  of  12  learners  are  able  to  engage  in  comprehension  mon¬ 
itoring  on  relatively  complex  tasks  such  as  learning  from  exposi¬ 
tory  texts  (e.g.,  de  Bruin  &  Van  Gog,  2012;  note  that  this  does  not 
necessarily  entail  that  learners  utilize  highly  predictive  cues). 
Despite  possessing  this  skill,  recent  studies  suggest  that  both 
university  and  13-  to  15-year-old  high-school  students  hardly  ever 
engage  in  comprehension  monitoring  while  processing  new  learn¬ 
ing  material  (e.g.,  Berthold  &  Renkl,  2010;  Roelle,  Lehmkuhl, 
Beyer,  &  Berthold,  2015;  Sanchez  &  Garcia-Rodicio,  2013).  As  a 
consequence,  learners  might  often  overlook  comprehension  diffi¬ 
culties  that  could  have  been  detected  had  they  put  more  effort  into 
monitoring.  Based  on  research  on  metastrategic  knowledge  (e.g., 
Schraw,  1998;  Zohar  &  Peled,  2008),  one  possible  explanation  for 
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this  insufficient  engagement  could  be  that  learners  lack  a  type  of 
conditional  knowledge  regarding  comprehension  monitoring.  That 
is,  they  might  not  know  why  they  should  invest  high  amounts  of 
effort  in  this  process. 

To  test  this  assumption,  we  developed  a  relatively  parsimonious 
intervention.  To  adduce  good  reasons  for  why  a  higher  level  of 
engagement  in  comprehension  monitoring  could  be  useful,  we 
(simply)  informed  learners  about  the  high  frequency  of  overcon¬ 
fident  JOLs  and  the  detrimental  consequences  that  result  from 
making  them  (hereafter  referred  to  as  information  about  the  dan¬ 
gers  of  making  overconfident  JOLs).  Using  university  and  13-  to 
15-year-old  high  school  student  samples,  we  tested  the  effects  of 
this  intervention  on  (a)  comprehension  monitoring,  (b)  JOLs,  (c) 
cognitive  processes,  and  (d)  the  acquisition  of  conceptual  knowl¬ 
edge  in  a  subsequent  learning  phase  in  three  separate  experiments. 

The  Role  of  Comprehension  Monitoring  in 
Self-Regulated  Learning 

In  metacognitive  theory  of  self-regulated  learning  (e.g.,  Boe- 
kaerts,  1997;  Koriat  &  Goldsmith,  1996;  Metcalfe,  2009;  Nelson  & 
Narens,  1994)  comprehension  monitoring  is  linked  to  the  forma¬ 
tion  of  JOLs  and,  as  a  consequence,  to  control.  These  links  are 
articulately  described  in  Nelson  and  Narens’  (1994)  metacognitive 
model,  in  which  learning  processes  are  separated  into  two  inter¬ 
related  levels:  the  object-level  and  the  metalevel. 

Object-level  processes  directly  relate  to  objects  in  the  outside 
world.  For  instance,  summarizing  or  elaborating  on  new  learning 
content  (i.e.,  the  object)  would  be  considered  object-level  pro¬ 
cesses.  As  the  term  object-level  processes  is  rarely  used  in  educa¬ 
tional  psychology,  in  the  following  we  use  the  more  common  term 
cognitive  processes.  Metalevel  (or  metacognitive)  processes,  by 
contrast,  do  not  directly  relate  to  objects  in  the  outside  world  but 
to  the  object-level.  These  focus  on  (a)  the  current  state  of  the 
object-level,  (b)  the  goal  state,  and  (c)  available  strategies  to 
change  the  status  of  the  object-level  (see  de  Bruin  &  Van  Gog, 
2012).  For  instance,  judging  one’s  level  of  comprehension  would 
be  considered  a  metalevel  process. 

Cognitive  processes  and  metalevel  processes  are  not  indepen¬ 
dent  of  each  other.  Rather,  they  are  interrelated  via  monitoring  and 
control.  Monitoring  refers  to  the  flow  of  information  from  the 
object-  to  the  metalevel.  When  monitoring,  learners  acquire  infor¬ 
mation  that  can  serve  as  input  for  the  metalevel  process  of  forming 
JOLs  (e.g.,  de  Bruin  &  Van  Gog,  2012;  Nelson,  1996).  JOLs,  in 
turn,  inform  cognitive  processes.  This  direction  of  the  flow  of 
information  is  termed  control  (or  regulation ;  see  de  Bruin  &  Van 
Gog,  2012).  For  example,  based  on  their  JOLs  learners  could 
decide  to  engage  in  more  cognitive  processes  or  focus  their  cog¬ 
nitive  processes  particularly  on  content  that  they  have  identified  as 
being  not  yet  well-understood.  In  the  following,  we  will  use 
regulation  instead  of  control  as  it  is  the  more  common  term  in 
educational  psychology. 

The  theoretical  notion  that  monitoring  informs  learners’  JOLs, 
which,  in  turn,  inform  decisions  regarding  what  and  how  to  study 
(i.e.,  regulation)  has  been  firmly  established  by  research  in  the 
field  of  metacognition  and  self-regulated  learning  (e.g.,  De  Bruin, 
Thiede,  Camp,  &  Redford,  2011;  Dunlosky  &  Rawson,  2012; 
Koriat,  2012;  Metcalfe  &  Finn,  2008;  Pieschl,  Stahl,  Murray,  & 
Bromme,  2012;  Son  &  Metcalfe,  2005;  Thiede  &  Dunlosky,  1999; 


Thiede  et  al.,  2010;  van  Loon,  De  Bruin,  Van  Gog,  &  Van 
Merrienboer,  2013).  Even  so,  research  in  this  field  also  shows  that 
regulation  is  frequently  driven  by  inaccurate  JOLs.  Empirical 
studies  have  indicated  that  learners  often  overconfidently  judge 
their  level  of  comprehension,  especially  when  they  have  a  low 
level  of  prior  knowledge  regarding  content  that  is  to  be  learned 
(e.g.,  when  dealing  with  a  new  topic;  see  Dunlosky,  Hartwig, 
Rawson,  &  Lipko,  2011;  Dunlosky  &  Rawson,  2012;  Dunning, 
Johnson,  Ehrlinger,  &  Kruger,  2003;  Karpicke,  2009;  Miesner  & 
Maki,  2007).  This  might  be  because  of  different  reasons. 

\ 

Why  Learners  Often  Overconfidently  Judge  Their 
Level  of  Comprehension 

Overconfident  JOLs  are  often  explained  by  reference  to  the 
cue -utilization  approach  (e.g.,  Koriat,  1997;  Rawson,  Dunlosky,  & 
Thiede,  2000).  This  approach  states  that  to  monitor  their  compre¬ 
hension,  learners  utilize  cues  that  they  view  as  being  predictive 
with  respect  to  how  well  they  have  comprehended  given  materials 
(e.g.,  expository  texts  or  instructional  explanations).  However, 
findings  on  cue  utilization  indicate  that  learners  often  have  low 
comprehension  monitoring  skills  because  they  frequently  utilize 
cues  that  do  not  entail  a  high  degree  of  predictive  validity  in  terms 
of  comprehension.  For  instance,  Thiede  et  al.  (2010)  found  that  to 
monitor  their  comprehension  of  expository  texts  on  an  unfamiliar 
topic,  learners  often  utilize  memory-based  cues  such  as  their 
ability  to  recall  information  that  was  included  in  the  text  (see  also 
Baker  &  Dunlosky,  2006).  With  respect  to  comprehension,  such 
cues  likely  lead  to  overconfident  JOLs  because  recalling  informa¬ 
tion  does  not  presuppose  that  learners  have  deeply  comprehended 
the  information  beforehand.  Hence,  it  is  conceivable  that  learners 
are  able  to  recall  information  they  have  not  comprehended  very 
well.  By  contrast,  because  deeply  comprehending  information 
presupposes  that  the  respective  information  has  been  sufficiently 
integrated  into  one’s  prior  knowledge  (e.g.,  Chi,  2009;  Wittrock, 
2010),  it  is  highly  unlikely  that  learners  who  have  deeply  compre¬ 
hended  information  are  unable  to  recall  it.  Thus,  it  can  be  assumed 
that  the  amount  of  recallable  information  often  exceeds  the  amount 
of  well-comprehended  information.  If  such  is  the  case,  then  uti¬ 
lizing  memory-based  cues  for  comprehension  monitoring  leads  to 
overconfident  JOLs. 

In  light  of  this,  a  large  proportion  of  the  interventions  that  were 
designed  to  foster  comprehension  monitoring  have  focused  on 
increasing  the  degree  to  which  learners  utilize  predictive  cues  (e.g., 
by  requiring  learners  to  construct  concept  maps  or  self-explain 
provided  information,  see  Griffin  et  al.,  2008;  Redford  et  al.,  2012; 
see  also  Thiede  &  Anderson,  2003;  Thiede  et  al.,  2003,  2010). 
However,  low  monitoring  skills  may  not  be  the  only  reason  for  the 
perennial  problem  of  overconfident  JOLs.  A  second  reason  might 
be  that,  irrespective  of  the  cues  being  utilized,  the  learners’  will  to 
engage  in  comprehension  monitoring,  is  relatively  low.  For  in¬ 
stance,  the  results  of  studies  that  analyzed  the  learning  processes  of 
learners  who  received  written  instructional  explanations  designed 
to  introduce  them  to  new  content  indicate  that  both  university 
students  (Sanchez  &  Garcla-Rodicio,  2013)  and  13-  to  15-year-old 
high  school  students  rarely  engage  in  comprehension  monitoring 
(Roelle,  Lehmkuhl  et  al.,  2015;  see  also  Berthold  &  Renkl,  2010). 
Similarly,  research  on  fostering  learning  strategies  by  means  of 
writing  learning  protocols  have  shown  that  both  university  and 
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high  school  students  invest  relatively  little  effort  in  comprehension 
monitoring  even  when  they  were  prompted  to  do  so  (e.g.,  Nuckles, 
Hiibner,  Diimer,  &  Renkl,  2010;  Roelle,  Kruger,  Jansen,  & 
Berthold,  2012).  As  a  consequence,  learners  might  often  overlook 
comprehension  difficulties  that  they  could  have  detected  on  the 
basis  of  their  monitoring  skills. 

By  roughly  the  age  of  12,  learners  develop  the  basic  skills 
required  to  monitor  their  comprehension  on  relatively  complex 
tasks  such  as  learning  from  expository  texts  (de  Bruin  &  Van  Gog, 
2012,  note  that  this  does  not  necessarily  entail  that  learners  utilize 
highly  predictive  cues).  Hence,  although  learners  at  this  young  age 
might  avoid  comprehension  monitoring  because  it  is  less  likely  to 
be  automated  and  cognitively  more  demanding  than  the  execution 
of  other  processes  during  learning  from  expository  texts  (e.g., 
inferential  processes;  see  Kintsch,  2004)  not  even  for  13-  to 
15-year-old  high  school  students  the  aforementioned  findings  can 
simply  be  attributed  to  nonexistent  monitoring  skills.  However, 
research  on  metastrategic  knowledge  (e.g.,  Paris,  Lipson,  &  Wix- 
son,  1983;  Schraw,  1998;  Zohar  &  Peled,  2008)  suggests  that 
acquiring  a  skill  is  a  necessary  but  insufficient  condition  for 
engaging  in  a  process  or  strategy;  the  will  to  do  so  is  also  neces¬ 
sary.  This  strand  of  research  shows  that  even  when  learners  have 
mastered  a  specific  process  (e.g.,  comprehension  monitoring), 
their  allocation  of  resources  to  the  respective  process  substantially 
depends  on  their  knowledge  of  the  potential  benefit(s)  of  the 
respective  invested  resources  (i.e.,  conditional  knowledge; 
Schraw,  1998;  see  also  Hiibner,  Nuckles,  &  Renkl,  2010).  Seen 
from  this  perspective,  the  low  degree  of  engagement  in  compre¬ 
hension  monitoring  could  be  because  of  the  fact  that  learners 
simply  do  not  know  why  they  should  increase  the  amount  of  effort 
they  invest  therein. 

If  this  is  the  case,  then  a  relatively  parsimonious  intervention  to 
foster  comprehension  monitoring  could  be  to  provide  them  with 
persuasive  reasons  why  more  engagement  in  comprehension  mon¬ 
itoring  could  be  worth  the  extra  effort.  For  example,  learners  could 
be  informed  about  the  likely  consequences  of  failing  to  do  so.  That 
is,  learners  might  benefit  from  (simply)  being  informed  about  the 
high  frequency  of  overconfident  JOLs  and  detrimental  conse¬ 
quences  that  result  from  making  them  (i.e.,  the  dangers  of  making 
overconfident  JOLs). 

How  Learners  Might  Benefit  From  Information  About 
the  Dangers  of  Making  Overconfident  JOLs 

It  is  reasonable  to  assume  that  potential  beneficial  effects  of 
informing  learners  about  the  dangers  of  making  overconfident 
JOLs  are  contingent  on  several  conditions.  First,  learners  need  to 
be  aware  that  comprehension  monitoring  serves  the  function  of 
informing  their  JOLs.  Otherwise,  informing  learners  about  the 
dangers  of  making  overconfident  JOLs  would  not  increase  the 
utility  value  of  comprehension  monitoring  and,  thus,  not  foster 
learners’  will  to  engage  in  it. 

A  second  condition  is  that  a  higher  level  of  engagement  in 
monitoring  actually  leads  to  a  higher  number  of  detected  compre¬ 
hension  difficulties.  The  effects  on  JOLs  are  not  necessarily  sub¬ 
ject  to  this  condition.  That  is,  informing  learners  about  the  dangers 
of  making  overconfident  JOLs  could  lead  to  more  cautious  (i.e., 
lower)  JOLs  even  if  it  does  not  lead  to  the  detection  of  any 
(additional)  comprehension  difficulties  because  learners  might 


simply  shallowly  subtract  a  certain  value  from  their  JOLs.  How¬ 
ever,  this  might  be  of  limited  use,  at  least  in  terms  of  reducing 
underachievement,  for  the  following  reason:  In  comparison  to  a 
lowering  of  JOLs  that  is  driven  via  an  increase  in  the  number  of 
detected  comprehension  difficulties,  it  would  scarcely  enhance  the 
basis  for  beneficial  regulation.  More  specifically,  if  learners  are 
not  aware  of  their  specific  comprehension  difficulties,  they  have 
no  point  of  reference  regarding  the  content  on  which  they  should 
focus  their  (additional)  cognitive  processes  to  remedy  their  diffi¬ 
culties  and,  thus,  improve  their  comprehension. 

A  third  condition  is  that  learners  are  able  to  use  the  knowledge 
regarding  their  comprehension  difficulties  to  effectively  regulate 
subsequent  cognitive  processes  (e.g..  Nelson,  Dunlosky,  Graf,  & 
Narens,  1994;  Son  &  Sethi,  2006).  This  ability  seems  to  develop 
later  than  the  basic  skills  required  to  engage  in  comprehension 
monitoring  (e.g.,  De  Bruin  et  al.,  2011).  Hence,  at  least  for  learners 
who  are  not  much  older  than  12  (i.e.,  the  age  at  which  learners 
usually  have  developed  the  basic  skills  required  to  monitor  their 
comprehension  on  relatively  complex  tasks;  see  de  Bruin  &  Van 
Gog,  2012)  it  is  conceivable  that  informing  them  about  the  dangers 
of  making  overconfident  JOLs  would  lead  to  (a)  an  increase  in  the 
number  of  detected  comprehension  difficulties  and  (b)  lower  JOLs 
but  (c)  would  not  necessarily  foster  regulation.  This  could  be  due 
in  part  to  the  fact  that  they  are  unaware  of  the  cognitive  processes 
they  can  use  to  effectively  remedy  their  difficulties. 

For  instance,  results  by  Roelle,  Muller,  Roelle,  and  Berthold 
(2015)  showed  that  when  13-  to  15-  year-old  high  school  students 
received  instructional  explanations  that  were  adapted  to  their  com¬ 
prehension  difficulties,  they  mainly  engaged  in  the  cognitive  pro¬ 
cess  of  repeating  the  explanations’  content.  Although  repeating 
content  can  be  undoubtedly  useful  in  remedying  comprehension 
difficulties,  its  effectiveness  is  restricted  to  relatively  minor  ones. 
If  learners  experience  substantial  comprehension  difficulties,  en¬ 
gaging  in  deep-oriented  cognitive  processes  such  as  thinking  con¬ 
tent  through  using  one’s  own  examples  or  explaining  content  in 
one’s  own  words  might  be  more  effective.  In  contrast  to  repeating, 
these  processes  prevent  learners  from  rote  learning  content  they  do 
not  really  understand.  They  also  encourage  learners  to  make  con¬ 
nections  between  new  and  not  yet  well-understood  content  and 
their  prior  knowledge,  which  is  theorized  to  increase  the  level  of 
comprehension  (e.g.,  Chi,  2009;  Wittrock,  2010).  If  learners  are 
not  aware  of  the  utility  value  of  these  cognitive  processes,  the 
benefit  from  potential  increases  in  the  number  of  detected  com¬ 
prehension  difficulties  could  be  quite  limited. 

Research  Questions 

In  light  of  these  theoretical  considerations,  we  examined 
whether  and,  as  the  case  may  be,  how  informing  learners  about  the 
dangers  of  making  overconfident  JOLs  affects  comprehension 
monitoring,  JOLs,  cognitive  processes,  and  the  acquisition  of 
conceptual  knowledge  in  a  subsequent  learning  phase  in  which 
learners  received  written  instructional  explanations  that  present 
basic  conceptual  knowledge  relating  to  a  new  topic.  First,  we  were 
interested  in  whether  informing  learners  about  the  dangers  of 
making  overconfident  JOLs  would  increase  their  engagement  in 
comprehension  monitoring.  Given  that  learners  do  not  instantly 
reach  a  perfect  understanding  of  the  explanations’  content,  this 
potential  increase  in  engagement  should  be  reflected  in  an  increase 


102 


ROELLE,  SCHMIDT,  BUCHAU,  AND  BERTHOLD 


in  the  number  of  detected  comprehension  difficulties  (Research 
Question  1 ).  Second,  we  wanted  to  investigate  whether  this  poten¬ 
tial  effect  would  lead  to  more  cautious  JOLs.  More  specifically,  we 
were  interested  in  whether,  given  the  same  level  of  acquired 
conceptual  knowledge,  learners  who  were  informed  about  the 
dangers  of  making  overconfident  JOLs  would  show  lower  JOLs 
than  uninformed  learners  because  of  the  potentially  higher  number 
of  detected  comprehension  difficulties  (Research  Question  2). 

Third,  we  were  interested  in  whether  this  intervention  would 
foster  beneficial  regulation  decisions.  To  this  end,  we  investigated 
whether  informed  learners  would  acquire  more  conceptual  knowl¬ 
edge  from  the  instructional  explanations  than  uninformed  learners 
(Research  Question  3).  To  gain  insight  into  the  regulation  deci¬ 
sions  that  drive  the  potential  effect  on  the  acquisition  of  conceptual 
knowledge,  we  analyzed  the  learners’  cognitive  processes.  Specif¬ 
ically,  we  were  interested  in  whether  the  informed  learners  would 
engage  in  more  cognitive  processes  or  focus  more  on  not  yet 
well-understood  content  than  the  uninformed  learners  (Research 
Question  4).  The  latter  should  be  reflected  in  an  increase  in  the 
correlation  between  the  number  of  cognitive  processes  and  the 
acquisition  of  conceptual  knowledge  from  the  instructional  expla¬ 
nations. 

Experiment  1 

Method 

Sample  and  design.  The  participants  of  this  experiment  were 
N  =  42  students  from  different  departments  of  a  German  univer¬ 
sity  (29  women,  13  men;  mean  age  =  25.48  years).  The  sample 
size  was  determined  by  an  analysis  of  power  with  the  following 
parameters:  a  =  .05;  power  =  .80;  d  =  0.80  (large  effect); 
one-tailed.  This  analysis  suggested  including  at  least  21  partici¬ 


pants  per  condition.  The  participants  received  course  credit  for 
their  participation.  As  we  intended  to  exclusively  recruit  partici¬ 
pants  with  low  prior  knowledge  regarding  the  topic  of  the  learning 
material  (i.e.,  atomic  structure;  see  below)  students  of  the  depart¬ 
ment  of  chemical  science  and  students  who  took  chemistry  as 
advanced  level  (i.e.,  A-level)  course  at  high  school  were  excluded 
from  the  study. 

We  randomly  assigned  the  participants  to  one  experimental  and 
one  control  condition  ( n  =  21  in  each  condition).  Before  they 
received  written  instructional  explanations  relating  to  the  topic  of 
atomic  structure,  the  learners  received  either  information  about  the 
dangers  of  making  overconfident  JOLs  ( information-about - 
overconfidence  condition)  or  no  such  information  ( control  condi¬ 
tion). 

Materials. 

Information  about  the  dangers  of  making  overconfident  JOLs. 
We  designed  a  multimedia  presentation  to  inform  the  participants 
in  the  information-about-overconfidence  condition  about  the  dan¬ 
gers  of  making  overconfident  JOLs.  The  first  part  was  designed  to 
highlight  the  fact  that  overconfidently  judging  one’s  level  of 
knowledge  is  a  frequently  occurring  phenomenon.  First,  the  learn¬ 
ers  were  told  that  empirical  studies  suggest  that  there  is  a  strong 
correlation  between  one’s  level  of  prior  knowledge  of  a  topic  and 
the  accuracy  of  JOLs:  The  lower  the  level  of  knowledge,  the  lower 
the  judgment  accuracy.  After  that,  they  were  presented  with  a 
series  of  studies  conducted  by  Dunning  et  al.  (2003),  which  found 
that  learners  with  low  levels  of  knowledge  of  a  topic  frequently 
make  overconfident  JOLs.  To  encourage  deep  processing,  the 
learners  were  prompted  to  interpret  and  describe  a  diagram  that 
showed  the  results  of  one  of  the  studies  (see  Figure  1).  Based  on 
Dunning  et  al.,  this  pattern  of  results  was  labeled  as  the  double 
curse  of  incompetence'.  Learners  who  have  low  levels  of  knowl- 


Actual  Performance  Quartile 


What  key  information  does 
the  chart  convey? 


Figure  1.  Screenshot  of  one  page  of  the  multimedia  presentation  about  the  dangers  of  making  overconfident 
judgments  of  learning  (JOLs)  in  Experiment  1  (translated  from  German).  See  the  online  article  for  the  color 
version  of  this  figure. 
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edge  of  a  specific  topic  not  only  perform  poorly  on  tests  (first 
curse)  but  also  judge  their  level  of  knowledge  overconfidently 
(second  curse).  The  first  part  of  the  presentation  ended  with  the 
caveat  that  all  learners  dealing  with  a  new  topic  run  the  risk  of 
overconfidently  judging  their  comprehension. 

The  second  part  ot  the  presentation  was  designed  to  emphasize 
the  detrimental  consequences  of  making  overconfident  JOLs.  The 
participants  were  shown  a  scene  in  which  a  model  learner  reads  an 
expository  text  on  a  new  topic  (i.e.,  the  diathesis-stress-model)  and 
then  judges  his  level  of  comprehension  as  being  high.  Having 
made  this  decision,  he  then  decides  to  end  the  learning  phase.  The 
participants  were  then  told  that  this  JOL  might  be  overconfident 
because  the  learner  has  little  prior  knowledge  of  the  topic.  As  a 
consequence,  he  might  not  see  the  (potentially  existent)  need  to 
further  process  the  content  after  reading  the  expository  text  and, 
thus,  terminates  the  learning  phase  prematurely. 

The  multimedia  presentation  ended  with  a  brief  summary  of  the 
main  content.  Because  the  presentation  could  have  functioned  as  a 
monitoring  prompt  (and  go  beyond  the  sole  provision  of  condi¬ 
tional  knowledge  about  monitoring),  no  link  was  made  between 
the  double  curse  of  incompetence  and  a  lack  of  engagement  in 
comprehension  monitoring.  Thus,  although  the  presentation  was 
designed  to  adduce  good  reasons  why  a  higher  level  of  engage¬ 
ment  in  comprehension  monitoring  could  be  useful,  it  did  not 
include  any  explicit  information  about  the  role  of  comprehension 
monitoring. 

The  multimedia  presentation  lasted  approximately  13  min. 
However,  because  of  the  diagram  description  task  and  the  possi¬ 
bility  to  proceed  at  their  own  pace,  the  learners  spent  an  average  of 
18  min  watching  the  presentation.  Afterward,  the  participants 
answered  four  multiple-choice  questions  that  were  designed  to 
assess  whether  they  had  understood  the  main  content  of  the  pre¬ 
sentation  (e.g.,  “How  are  the  level  of  knowledge  regarding  specific 
content  and  the  accuracy  of  JOLs  of  the  respective  content  re¬ 
lated?”).  In  case  they  failed  to  correctly  respond  to  any  of  these 
questions,  they  received  immediate  feedback  that  explained  the 
correct  answer  on  the  following  screen. 

Expository  text  about  the  diathesis-stress  model.  Watching 
the  multimedia  presentation  might  to  some  extent  reduce  the 
amount  of  mental  energy  of  the  learners  in  the  information-about- 
overconfidence  condition  that  is  available  for  processing  the  sub¬ 
sequent  instructional  explanations.  To  help  balance  the  conditions, 
we  had  the  learners  in  the  control  condition  work  on  a  task  before 
they  received  the  instructional  explanations.  Specifically,  they  had 
to  read  and  understand  an  expository  text  on  the  diathesis-stress- 
model.  On  average,  the  learners  spent  approximately  13  min 
working  on  this  task. 

Computer-based  learning  environment:  Written  instructional 
explanations  related  to  the  topic  atomic  structure.  In  the  learn¬ 
ing  phase,  which  directly  followed  the  multimedia  presentation 
about  the  dangers  of  making  overconfident  JOLs  or  the  expository 
text  about  the  diathesis-stress  model,  respectively,  all  learners 
worked  in  a  computer-based  learning  environment.  The  learning 
environment  consisted  of  three  units  of  written  instructional  ex¬ 
planations  that  were  designed  to  communicate  basic  conceptual 
knowledge  concerning  the  topic  of  atomic  structure.  Conceptual 
knowledge  refers  to  knowledge  about  facts,  concepts,  and  princi¬ 
ples  that  apply  within  a  domain  (de  Jong  &  Ferguson-Hessler, 
1996).  Unit  1  related  to  the  structure  of  the  atomic  core,  Unit  2 


related  to  the  structure  of  the  atomic  shell,  and  Unit  3  related  to 
ionization  energy.  The  instructional  explanations  within  each  unit 
were  closely  related  to  each  other.  In  total,  the  instructional  ex¬ 
planations  covered  10  basic  principles  and  concepts.  The  instruc¬ 
tional  explanations  were  sometimes  accompanied  by  graphics  that 
were  based  on  ones  used  in  school-level  chemistry  textbooks  (see 
Figure  2). 

To  gather  data  on  the  learning  processes  in  which  the  learners 
engaged  while  processing  the  instructional  explanations,  each  in¬ 
structional  explanation  was  accompanied  by  a  prompt  that  told  the 
learners  to  write  down  their  thoughts  on  the  content  of  the  expla¬ 
nation  (i.e.,  “Write  down  your  thoughts  on  the  explanation”).  All 
learners  were  required  to  type  their  responses  to  these  prompts  into 
text  boxes  located  next  to  the  instructional  explanations  (i.e.,  10 
text  boxes). 

Instruments  and  Measures. 

Pretest:  Assessment  of  prior  conceptual  knowledge.  A  pretest 
assessed  the  learners’  prior  conceptual  knowledge  on  the  topic  of 
atomic  structure  using  four  open-ended  questions.  For  instance,  the 
learners  received  the  following  question:  “How  are  the  mass 
number,  the  nuclear  charge,  and  the  number  of  neutrons  related?” 
Two  of  the  questions  related  to  the  first  unit  of  the  computer-based 
learning  environment,  the  other  two  questions  related  to  Unit  2  or 
3,  respectively.  Based  on  a  scoring  protocol,  the  learners’  answers 
were  examined  for  correct  arguments  for  each  of  the  questions. 
Two  independent  raters  who  were  blind  to  the  hypotheses  and 
conditions  scored  the  written  answers  of  15  learners.  Interrater 
reliability  as  determined  by  the  intraclass  coefficient  (ICC)  with 
measures  of  absolute  agreement  was  very  good  for  each  of  the  four 
questions  (all  ICC  >  .84).  Thus,  only  one  rater  scored  the  rest  of 
the  written  answers.  The  raw  scores  were  summed  up  and  trans¬ 
formed  into  percentage  scores  of  the  theoretical  maximum  number 
of  points  that  could  have  been  achieved  in  the  pretest  (i.e.,  24 
points). 

Assessment  of  learning  processes.  To  gain  insight  into  the 
learning  processes  the  learners  engaged  in  while  processing  the 
instructional  explanations  in  the  computer-based  learning  environ¬ 
ment,  we  analyzed  their  text  box  entries.  This  analysis  revealed 
that  the  learners  mainly  engaged  in  comprehension  monitoring  as 
well  as  two  categories  of  cognitive  processes:  content  repetitions 
and  elaborations.  The  learners’  engagement  in  comprehension 
monitoring  was  assessed  via  the  number  of  detected  comprehen¬ 
sion  difficulties.  For  instance,  if  the  learners  wrote  “There’s  one 
thing  I  didn’t  really  understand.  When  magnesium  surrenders  two 
electrons  and  chlorine  takes  only  one  of  them  during  the  ionization 
process,  what  happens  to  the  other  electron  that  wasn’t  taken  by 
the  chlorine  atom?”,  it  was  coded  as  a  monitoring  episode.  If  the 
learners  wrote  down  information  that  was  explicitly  included  in  the 
explanations  without  adding  any  new  information  (e.g.,  “Protons 
are  located  in  the  atomic  nucleus”),  it  was  coded  as  a  repetition.  By 
contrast,  if  the  learners  generated  correct  information  that  not  only 
related  to  but  also  went  beyond  the  information  included  in  the 
explanations  (e.g.,  examples,  inferences,  links  to  prior  knowledge, 
explanations/summaries  of  content  in  one’s  own  words;  see  Chi, 
2009),  it  was  coded  as  an  elaboration  (e.g.,  “Ions  are  the  label 
given  to  atoms  whose  number  of  electrons  deviate  from  the  normal 
balance  of  protons  and  electrons  of  an  atom.”). 

The  segment  size  was  determined  by  the  grain  size  of  the 
content  of  the  instructional  explanations  as  well  as  the  three  types 
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Ernest  Rutherford’s  Model  of  the  Atom 


According  to  Rutherford's  atomic  model,  every  atom  consists  of  a  wrife  down  your  thoughts  on  the  explanation. 

positively  charged  nucleus  and  a  negatively  charged  atomic  shell.  The  j 

nucleus  is  ca.  10,000  times  smaller  than  the  atom  itself  (see  1 

illustration).  Two  types  of  components  can  be  found  in  the  nucleus: 

protons  and  neutrons.  Protons  are  positively  charged  components, 

whereas  neutrons  have  no  charge.  Only  one  type  of  component  can  be 

found  in  the  atomic  shell:  electrons.  Electrons  are  negatively  charged. 

The  charge  of  one  electron  and  one  proton  are  the  same  size,  but  they 
are  polar  opposites.  Under  normal  conditions  every  atom  is  electrically 
neutral,  meaning  the  positive  charge  of  the  nucleus  and  the  negative 
charge  of  the  shell  balance  out. 


nucleus 

\ 


atomic  shell 


Illustration:  Schematic  representation  of  Rutherford’s  atomic  model 


Figure  2.  Screenshot  of  an  instructional  explanation  in  Experiment  1  (translated  from  German).  See  the  online 
article  for  the  color  version  of  this  figure. 


of  learning  processes.  Thus,  the  learners’  text  box  entries  were 
segmented  such  that  each  segment  related  to  only  one  specific 
content  item  of  the  respective  instructional  explanation.  However, 
as  describing  a  comprehension  difficulty  or  an  elaboration  often 
required  more  words  than  repeating  a  specific  content  item,  the 
segments  nevertheless  substantially  differed  in  length. 

Two  independent  raters  who  were  blind  to  the  hypotheses  and 
conditions  coded  ca.  20%  of  the  text  box  entries.  Interrater  reli¬ 
ability  as  determined  by  Cohen’s  k  was  good  (k  =  .72).  In  case  of 
divergence,  the  coders  reexamined  the  respective  segments  and 
made  a  joint  decision.  As  interrater  reliability  was  good,  only  one 
rater  coded  the  rest  of  the  text  box  entries. 

Assessment  of  JOLs.  At  the  end  of  each  of  the  three  units  of 
the  computer-based  learning  environment,  the  learners  were  asked 
to  judge  their  level  of  comprehension  of  the  previous  explanations’ 
content  on  a  6-point  rating  scale  ranging  from  0  ( inadequate )  to  5 
(very  good).  As  mentioned  above,  the  instructional  explanations 
that  were  included  in  each  of  the  three  units  were  closely  related  to 
each  other.  Thus,  it  is  possible  that  the  third  explanation  that  was 
provided  in  Unit  1  helped  learners  understand  the  content  of  the 
first  or  second  explanation  of  Unit  1 .  Therefore,  we  decided  not  to 
require  learners  to  judge  their  level  of  comprehension  after  each  of 
the  explanations  but  solely  at  the  end  of  each  unit.  For  the  later 
analyses,  the  three  JOLs  were  averaged  (i.e.,  theoretical  min:  0; 
theoretical  max:  5). 

Posttest:  Assessment  of  conceptual  knowledge.  A  posttest 
assessed  the  learners’  conceptual  knowledge  with  respect  to  the 
concepts  and  principles  that  were  covered  by  the  instructional 


explanations  in  the  computer-based  learning  environment.  The 
posttest  included  all  four  pretest  items  as  well  as  eight  additional 
open-ended  questions.  Each  of  the  three  units  of  the  computer- 
based  learning  environment  was  covered  by  four  questions.  For 
instance,  the  learners  were  asked  to  explain  possible  reasons  for 
differences  in  the  ionization  energy  of  the  different  electrons  of  an 
atom  (Unit  3)  or  to  formulate  rules  according  to  which  the  electron 
shells  of  an  atom  are  filled  (Unit  2). 

Based  on  a  scoring  protocol,  the  learners’  answers  were  exam¬ 
ined  for  correct  arguments  for  each  question.  Two  independent 
raters  who  were  blind  to  the  hypotheses  and  conditions  scored  the 
written  answers  of  15  learners.  Interrater  reliability  as  determined 
by  the  intraclass  coefficient  with  measures  of  absolute  agreement 
was  very  good  for  each  of  the  12  questions  (all  ICC  >  .82). 
Therefore,  only  one  rater  scored  the  rest  of  the  written  answers. 
The  raw  scores  were  summed  up  and  transformed  into  percentage 
scores  of  the  theoretical  maximum  number  of  points  that  could 
have  been  achieved  in  the  posttest  (i.e.,  72  points). 

Procedure.  The  participants  worked  at  a  computer  in  individ¬ 
ual  sessions.  First,  all  learners  filled  ip.  a  demographics  question¬ 
naire.  Second,  all  learners  took  the  pretest.  Then,  depending  on  the 
experimental  condition,  the  learners  either  watched  the  multimedia 
presentation  about  the  dangers  of  making  overconfident  JOLs 
(with  headphones)  or  read  the  expository  text  on  the  diathesis- 
stress-model.  Next,  the  learners  entered  the  computer-based  learn¬ 
ing  environment  and  received  a  short  introduction.  Specifically, 
they  were  told  that  they  could  not  go  back  to  an  explanation  after 
they  had  clicked  on  the  next  page  button  and  were  shown  how  to 
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use  the  text  boxes.  During  the  learning  phase,  the  participants 
processed  the  instructional  explanations  sequentially  at  their  own 
pace.  At  the  end  of  each  unit,  they  judged  their  level  of  compre¬ 
hension.  Finally,  all  participants  completed  the  posttest.  The  ex¬ 
periment  lasted  approximately  2  hr. 

Results 

Table  1  shows  the  mean  scores  and  SDs  for  the  two  groups  on 
all  measures  of  the  study.  An  a-level  of  .05  was  used  for  all  tests. 
For  F  tests,  we  report  r|p  as  the  effect  size  measure  qualifying 
values  of  approximately  .01  as  small  effects,  values  of  approxi¬ 
mately  .06  as  medium  effects,  and  values  of  approximately  .14  or 
more  as  large  effects;  for  t  tests,  we  report  Cohen’s  d  qualifying 
values  of  approximately  0.20  as  small  effects,  values  of  approxi¬ 
mately  0.50  as  medium  effects,  and  values  of  approximately  0.80 
or  more  as  large  effects  (see  Cohen,  1988). 

Preliminary  analyses.  Before  addressing  our  research  ques¬ 
tions,  we  first  tested  whether  there  was  a  significant  difference 
concerning  the  pretest  scores  of  the  two  groups.  A  t  test  did  not 
yield  a  statistically  significant  effect,  r(31.45)  =  0.78,  p  =  .443, 
d  =  0.24.  Hence,  there  was  no  significant  difference  with  respect 
to  the  learners’  prior  conceptual  knowledge  of  the  topic  atomic 
structure.  On  average,  the  learners  reached  18.6%  ( SD  =  14.3%) 
of  the  theoretical  maximum  score  of  the  pretest,  which  indicates 
that,  as  intended,  they  had  rather  low  prior  conceptual  knowledge 
of  the  topic  atomic  structure.  Nevertheless,  to  reduce  error  vari¬ 
ance,  we  included  the  pretest  score  as  a  covariate  in  all  subsequent 
analyses  (with  the  exception  of  the  analyses  regarding  JOLs;  see 
below).  For  all  analyses,  the  assumption  of  homogeneous  within 
group  regression  slopes  was  not  violated. 

Second,  as  a  type  of  implementation  check,  we  analyzed  the 
learners  in  the  information-about-overconfidence  group’s  re¬ 
sponses  to  the  multiple-choice  questions  that  followed  the  multi- 
media  presentation  about  the  dangers  of  making  overconfident 
JOLs.  We  found  that  91.6%  of  the  responses  were  correct.  Fur¬ 
thermore,  only  one  participant  responded  incorrectly  to  more  than 
one  of  the  questions  (i.e.,  two  questions).  Jointly,  these  findings 
indicate  that  the  learners  understood  the  main  content  of  the 
presentation.  As  the  learners  received  feedback  whenever  they 
responded  incorrectly  (see  Method  section),  none  of  the  learners 
were  excluded  from  the  further  analyses. 

Monitoring  and  JOLs.  In  Research  Question  1,  we  were 
interested  in  whether  informing  learners  about  the  dangers  of 
making  overconfident  JOLs  would  increase  their  engagement  in 
monitoring,  which  should  be  reflected  in  an  increase  in  the  number 


Table  1 

Means  (and  SDs)  of  All  Measures  in  Experiment  1 


Variable 

Information-about- 
overconfidence  group 

Control  group 

Pretest  (%) 

.17  (.10) 

.20  (.18) 

Judgment  of  learning 

2.95  (.95) 

3.22(1.11) 

Posttest  (%) 

.52  (.20) 

.41  (.19) 

Monitoring  episodes 

3.38  (4.05) 

1.28  (1.85) 

Repetitions 

22.14(17.11) 

20.14(19.74) 

Elaborations 

4.43  (4.52) 

3.62  (4.42) 

Learning  time  (in  minutes) 

45.23  (19.36) 

35.19(14.19) 

of  detected  comprehension  difficulties  (i.e.,  monitoring  episodes). 
The  analysis  of  covariance  (ANCOVA;  covariate:  pretest  score) 
yielded  a  statistically  significant  main  effect  of  condition,  F(l, 
39)  =  4.50,  p  =  .040,  T|p  =  .10.  The  learners  in  the  information- 
about-overconfidence  group  detected  more  comprehension  diffi¬ 
culties  than  the  learners  in  the  control  group. 

In  Research  Question  2,  we  were  interested  in  whether  the  effect 
on  the  number  of  detected  comprehension  difficulties  would,  given 
the  same  level  of  acquired  conceptual  knowledge,  lead  to  the 
informed  learners  making  lower  JOLs  than  the  learners  in  the 
control  group.  To  address  this  research  question,  we  first  analyzed 
whether  the  two  groups  differed  in  the  level  of  JOLs.  To  hold  the 
level  of  acquired  conceptual  knowledge  constant,  the  posttest  score 
was  included  as  a  covariate  in  this  analysis;  the  assumption  of 
homogeneous  within  group  regression  slopes  was  not  violated.  The 
ANCOVA  yielded  a  marginally  significant  effect  of  condition, 
F(l,  39)  =  3.58,  p  =  .066,  rip  =  .08.  The  informed  learners 
showed  lower  JOLs  (Mestimated  =  2.81,  SE  =  0.20)  than  the 
uninformed  learners  (Mestimated  =  3.36,  SE  =  0.20). 

To  test  whether  this  effect  was  mediated  via  the  effect  on  the 
number  of  detected  comprehension  difficulties,  in  the  next  step  we 
conducted  a  mediation  analysis  using  Preacher  and  Hayes’  (2008) 
bootstrapping  method.  This  method  entails  building  an  empirical 
approximation  of  the  indirect  effect’s  sampling  distribution  by 
repeatedly  resampling  the  data  and  estimating  the  indirect  effect 
thousands  of  times.  In  the  present  study,  we  generated  95%  bias- 
corrected  bootstrap  confidence  intervals  (CIs)  from  5,000  boot¬ 
strap  samples  using  the  SPSS  macro  INDIRECT  (Preacher  & 
Hayes,  2008).  The  mediation  analysis  yielded  a  statistically  sig¬ 
nificant  indirect  effect  of  the  number  of  monitoring  episodes,  a  X 
b  =  -0.191  (LCL  =  -0.511,  UCL  =  -0.011).  This  result 
suggests  that  the  lowering  of  the  JOLs  on  part  of  the  learners  in  the 
information-about-overconfidence  group  was  mediated  via  the 
higher  number  of  comprehension  difficulties  they  had  detected. 

Posttest  scores  and  cognitive  processes.  In  Research  Question 
3,  we  were  interested  in  whether  the  learners  in  the  information- 
about-overconfidence  group  would  acquire  more  conceptual 
knowledge  from  the  instructional  explanations  than  their  counter¬ 
parts  in  the  control  group.  With  respect  to  the  posttest  scores,  the 
ANCOVA  (covariate:  pretest  score)  showed  a  statistically  signif¬ 
icant  effect  of  condition,  F(l,  39)  =  4.95,  p  =  .032,  t^  =  .11.  The 
learners  in  the  information-about-overconfidence  group  reached 
higher  scores. 

In  the  next  step,  we  examined  whether  the  effect  on  the  posttest 
scores  was  driven  (a)  by  a  higher  number  of  cognitive  processes 
and/or  (b)  by  a  change  in  the  focus  of  the  cognitive  processes  such 
that  learners  in  the  information-about-overconfidence  group  fo¬ 
cused  more  on  not  yet  well-understood  content  than  the  unin¬ 
formed  learners  (Research  Question  4).  Concerning  the  number  of 
cognitive  processes,  we  did  not  find  a  statistically  significant  effect 
between  the  experimental  conditions  regarding  the  number  of 
content  repetitions,  F(l,  39)  =  0.05,  p  —  .818,  r|p  =  .00,  the 
number  of  elaborations,  F(l,  39)  =  0.76,  p  =  .387,  T)p  =  .02,  or  the 
total  number  of  cognitive  processes,  F(l,  39)  =  0.20,  p  =  .660, 
rip  =  .00.  As  for  (b)  we  found  that  the  total  number  of  cognitive 
processes  correlated  differently  with  the  posttest  scores  in  the  two 
groups.  In  the  information-about-overconfidence  group  the  num¬ 
ber  of  cognitive  processes  was  positively  correlated  with  the 
posttest  scores,  r  =  .53,  p  =  .016,  whereas  there  was  no  statisti- 
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cally  significant  correlation  in  the  control  group,  r  =  -.18,  p  — 
.439. 

For  exploratory  purposes,  we  also  analyzed  whether  the  two 
groups  differed  with  respect  to  the  time  spent  on  the  instructional 
explanations.  The  ANCOVA  did  not  show  a  statistically  signifi¬ 
cant  effect  of  condition,  F(l,  39)  =  3.02,  p  =  .090,  r\j  =  .07. 
Furthermore,  learning  time  did  not  significantly  correlate  with  the 
posttest  scores,  r  =  .20,  p  —  .191.  Jointly,  these  findings  suggest 
that  the  superiority  of  the  learners  in  the  information-about- 
overconfidence  group  regarding  the  posttest  scores  was  not  simply 
because  of  more  learning  time. 

Discussion 

In  short,  the  findings  of  Experiment  1  suggest  that  for  university 
students,  insufficient  comprehension  monitoring  and  its  concom¬ 
itant  underachievement  is  due  in  part  to  the  fact  that  these  learners 
lack  knowledge  regarding  the  benefits  of  comprehension  monitor¬ 
ing  (i.e.,  conditional  knowledge).  Informing  them  about  the  dan¬ 
gers  of  making  overconfident  JOLs  fostered  the  learners’  engage¬ 
ment  in  comprehension  monitoring  as  well  as  the  amount  of 
conceptual  knowledge  they  acquired  in  a  subsequent  learning 
phase. 

We  found  that  while  processing  the  instructional  explanations 
relating  to  the  topic  of  atomic  structure,  the  learners  in  the 
information-about-overconfidence  group  detected  more  compre¬ 
hension  difficulties  than  the  learners  in  the  control  group  (Re¬ 
search  Question  1).  Furthermore,  given  the  same  level  of  acquired 
conceptual  knowledge,  the  learners  in  the  information-about- 
overconfidence  group  made  lower  (i.e.,  more  cautious)  JOLs  than 
their  counterparts.  Notably,  these  lower  JOLs  were  not  simply 
because  of  the  learners  in  the  information-about-overconfidence 
group  shallowly  lowering  their  JOLs.  Rather,  the  effect  was  driven 
via  the  higher  number  of  detected  comprehension  difficulties  (Re¬ 
search  Question  2).  One  possible  explanation  for  this  pattern  of 
results  is  that  the  information  about  the  dangers  of  making  over¬ 
confident  JOLs  increased  the  perceived  utility  value  of  compre¬ 
hension  monitoring.  That  is,  being  aware  of  the  dangers  of  making 
overconfident  JOLs  served  as  a  persuasive  argument  why  learners 
should  engage  in  comprehension  monitoring  when  dealing  with 
new  content.  Consequently,  it  fostered  the  learners’  engagement 
therein,  which  led  to  a  higher  number  of  detected  comprehension 
difficulties.  These  detected  difficulties,  in  turn,  led  to  lower  JOLs. 

However,  it  should  be  highlighted  that  even  in  the  information- 
about-overconfidence  group  the  total  number  of  detected  compre¬ 
hension  difficulties  was  relatively  low.  On  average,  the  learners  in 
the  informed  group  detected  3.38  comprehension  difficulties  while 
processing  the  instructional  explanations.  In  conjunction  with  the 
relatively  low  average  posttest  scores  (i.e.,  ca.  52%),  this  finding 
suggests  that  even  the  learners  in  the  informed  group  failed  to 
detect  all  of  their  comprehension  difficulties.  One  explanation  for 
this  finding  is  that  to  monitor  their  comprehension,  the  learners 
utilized  mainly  memory-based  instead  of  comprehension-based 
cues  (see  Thiede  et  ah,  2010).  As  a  consequence,  the  learners 
might  have  missed  several  (deep)  comprehension  difficulties, 
which  resulted  in  relatively  low  numbers  of  detected  comprehen¬ 
sion  monitoring  and  relatively  low  posttest  performance. 

Although  the  absolute  number  of  detected  comprehension  dif¬ 
ficulties  was  relatively  low,  our  findings  nevertheless  suggest  that 


the  informed  learners  converted  their  superiority  in  terms  of  the 
number  of  detected  comprehension  difficulties  into  beneficial  reg¬ 
ulation  decisions.  They  acquired  more  conceptual  knowledge  from 
the  instructional  explanations  than  the  learners  in  the  control  group 
(Research  Question  3).  The  lack  of  effect  regarding  learning  time 
and  the  number  of  cognitive  processes  suggest  that  these  reg¬ 
ulation  decisions  did  not  simply  entail  that  the  learners  in¬ 
creased  the  amount  of  effort  they  invested  in  processing  the 
instructional  explanations.  Instead,  the  finding  that  the  number 
of  cognitive  processes  was  only  positively  correlated  with  the 
posttest  scores  in  the  information-about-overconfidence  group 
indicates  that  the  effect  on  the  acquisition  of  conceptual  knowl¬ 
edge  might  be  because  of  a  shift  in  the  focus  of  the  informed 
learners’  cognitive  processes  (Research  Question  4).  Because 
they  detected  more  comprehension  difficulties,  the  learners  in 
the  information-about-overconfidence  group  might  have  been 
better  able  to  focus  on  content  they  had  not  yet  fully  grasped 
than  the  learners  in  the  control  group.  Focusing  on  content  that 
is  not  yet  well  understood  should  be  beneficial  in  terms  of  the 
acquisition  of  conceptual  knowledge;  thus,  the  number  of  cog¬ 
nitive  processes  was  positively  correlated  with  the  posttest 
scores  in  the  information-about-overconfidence  group.  By  con¬ 
trast,  focusing  on  already  well-understood  content  might  not 
necessarily  entail  an  added  value;  thus,  the  number  of  cognitive 
processes  was  not  systematically  correlated  with  the  posttest 
scores  in  the  control  group.  However,  as  the  number  of  partic¬ 
ipants  in  the  two  groups  was  relatively  low,  this  interpretation 
of  the  within-group  correlations  should  be  treated  cautiously. 

Experiment  2 

The  main  goal  of  Experiment  2  was  to  investigate  whether 
providing  information  about  the  dangers  of  making  overconfident 
JOLs  would  have  similar  effects  for  13-  to  15-year-old  high  school 
students.  We  thought  this  age  group  would  be  suitable  for  analysis 
because,  as  previously  mentioned,  students  generally  develop  the 
basic  skills  required  to  monitor  text  comprehension  by  roughly  the 
age  of  12  (see  de  Bruin  &  Van  Gog,  2012).  Furthermore,  inter¬ 
ventions  that  have  been  found  to  foster  adult  learners’  comprehen¬ 
sion  monitoring  have  also  yielded  promising  results  when  given  to 
younger  students  (e.g.,  De  Bruin  et  al.,  2011;  Redford  et  al„  2012). 
Therefore,  it  can  be  expected  that  informing  13-  to  15-year-old 
high  school  students  about  the  dangers  of  making  overconfident 
JOLs  would  foster  their  engagement  in  comprehension  monitoring 
as  well.  However,  the  ability  to  use  information  regarding  one’ 
comprehension  difficulties  to  effectively  regulate  subsequent 
learning  seems  to  develop  later  than  the  ability  to  monitor  one’s 
own  comprehension  (e.g.,  De  Bruin  et  al.,  2011).  Hence,  learners 
in  this  age  group  might  not  be  able  to  take  beneficial  regulation 
decisions  and  benefit  from  being  informed  about  the  dangers  of 
making  overconfident  JOLs  in  terms  of  learning  outcomes  (i.e.,  the 
acquisition  of  conceptual  knowledge). 

Method 

Sample  and  design.  The  participants  of  this  experiment  were 
N  =  14  eighth-grade  students  from  different  classes  of  a  German 
high-track  high  school  (49  girls,  25  boys;  mean  age  =  13.51 
years).  The  sample  size  was  determined  by  an  analysis  of  power 
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with  the  following  parameters:  a  =  .05;  power  =  .80;  d  =  0.66 
(i.e.,  the  effect  size  regarding  comprehension  monitoring  of  Ex¬ 
periment  1),  one-tailed.  This  analysis  suggested  including  at  least 
30  participants  per  condition.  The  parents  of  all  students  gave 
consent  for  their  participation.  Because  of  technical  problems,  the 
data  of  three  participants  were  rendered  unusable.  Therefore,  com¬ 
plete  data  sets  were  available  for  N  =  71  students. 

We  randomly  assigned  the  participants  to  two  conditions.  As  in 
Experiment  1,  the  learners  received  either  information  about  the 
dangers  of  making  overconfident  JOLs  ( information-about - 
overconfidence  condition-,  n  =  35)  or  no  such  information  ( control 
condition ;  n  =  36)  before  they  received  written  instructional 
explanations  relating  to  the  topic  of  atomic  structure. 

Materials. 

Information  about  the  dangers  of  making  overconfident  JOLs. 
We  used  a  slightly  modified  version  of  the  multimedia  presenta¬ 
tion  used  in  Experiment  1.  To  adapt  the  presentation  to  the  age 
level  of  the  sample,  we  simplified  the  language  and  refrained  from 
using  foreign  words  or  technical  terms.  The  presentation  of  the 
series  of  empirical  studies  by  Dunning  et  al.  (2003)  was  shortened 
and  the  learners  were  not  prompted  to  describe  and  interpret  the 
main  content  of  the  diagram  that  depicted  the  main  results  of  these 
studies  because  it  might  have  overloaded  them.  We  also  adapted 
the  model  learner  who  was  presented  to  highlight  the  possible 
detrimental  consequences  of  making  overconfident  JOLs.  Specif¬ 
ically,  rather  than  reading  an  expository  text  on  the  diathesis- 
stress-model,  the  model  learner  read  an  expository  text  on  basic 
rules  of  English  grammar  that  was  designed  for  5th-grade  students. 

Except  for  these  few  modifications,  the  content  of  the  multime¬ 
dia  presentation  was  essentially  the  same  as  in  Experiment  1 .  The 
presentation  lasted  approximately  9  min.  Afterward,  the  partici¬ 
pants  answered  three  of  the  four  multiple-choice  questions  that 
were  used  in  Experiment  1  to  assess  whether  they  had  understood 
the  main  content  of  the  presentation.  The  question  that  was  left  out 
related  to  the  empirical  studies  of  Dunning  et  al.,  which  were  only 
presented  very  briefly  in  the  modified  presentation.  In  case  the 
learners  failed  to  correctly  respond  to  any  of  these  questions,  they 
received  immediate  feedback  that  explained  the  correct  answer  on 
the  following  screen. 

Video  about  optical  illusions.  As  Experiment  2  was  con¬ 
ducted  in  group  sessions  (the  learners  still  worked  individually  at 
a  computer,  see  Procedure  section)  and,  thus,  learners  might  have 
been  confused  if  not  all  of  them  needed  headphones,  in  contrast  to 
Experiment  1  the  learners  in  the  control  condition  watched  a  video 
instead  of  reading  an  expository  text  before  they  processed  the 
instructional  explanations.  The  video  was  a  sequence  of  a  You¬ 
Tube  video  about  optical  illusions.  The  video  lasted  approximately 
9  min.  Afterward,  the  participants  had  to  answer  three  multiple- 
,  choice  questions  that  were  designed  to  assess  whether  they  had 
understood  the  main  content  of  the  video. 

Computer-based  learning  environment:  Written  instructional 
explanations  related  to  the  topic  atomic  structure.  In  the  learn¬ 
ing  phase,  all  learners  worked  in  a  computer-based  learning  envi¬ 
ronment  that  consisted  of  written  instructional  explanations  relat¬ 
ing  to  the  topic  of  atomic  structure.  As  some  of  the  principles  and 
concepts  that  were  explained  in  Experiment  1  would  have  been  too 
difficult  for  this  age  group,  the  learners  in  Experiment  2  received 
only  one  unit  of  instructional  explanations,  which  dealt  with  four 
basic  principles  and  concepts  relating  to  atomic  structure  (i.e.,  four 


instructional  explanations).  Furthermore,  the  experiment  was  care¬ 
fully  timed  in  coordination  with  the  participants’  chemistry  teach¬ 
ers  to  ensure  that  the  content  of  the  instructional  explanations  was 
not  only  new  but  also  not  too  difficult  for  the  learners.  The  written 
explanations  were  partly  accompanied  by  graphics  that  are  gener¬ 
ally  used  in  8th-grade  chemistry  textbooks  in  Germany.  Each 
instructional  explanation  was  provided  on  one  page  in  the 
computer-based  learning  environment.  For  the  purpose  of  gather¬ 
ing  data  on  the  learning  processes  in  which  the  learners  engaged 
while  processing  the  instructional  explanations,  each  explanation 
was  accompanied  by  the  same  engaging  prompt  as  the  one  used  in 
Experiment  1  (i.e.,  “Write  down  your  thoughts  on  the  explana¬ 
tion”).  All  learners  were  required  to  type  their  responses  to  these 
prompts  into  text  boxes  that  were  placed  next  to  the  instructional 
explanations. 

Instruments  and  Measures. 

Pretest:  Assessment  of  prior  conceptual  knowledge.  A  pretest 
using  four  open-ended  questions  assessed  the  learners’  prior  con¬ 
ceptual  knowledge  on  the  topic  of  atomic  structure  (e.g.,  “Explain 
the  structure  of  an  atom’s  nucleus  according  to  the  Rutherford 
model”).  Each  instructional  explanation  of  the  computer-based 
learning  environment  was  covered  by  one  pretest  question.  The 
learners’  answers  were  examined  for  correct  arguments  for  each  of 
the  questions  based  on  a  scoring  protocol.  Two  independent  raters 
who  were  blind  to  the  hypotheses  and  conditions  scored  the  written 
answers  of  20  learners.  Interrater  reliability  as  determined  by  the 
intraclass  coefficient  with  measures  of  absolute  agreement  was 
very  good  for  each  of  the  four  questions  (all  ICC  >  .85)  so  only 
one  rater  scored  the  rest  of  the  written  answers.  The  raw  scores 
were  summed  up  and  transformed  into  percentage  scores  of  the 
theoretical  maximum  number  of  points  that  could  have  been 
achieved  in  the  pretest  (i.e.,  15  points). 

Assessment  of  learning  processes.  Using  the  coding  scheme 
from  Experiment  1 ,  the  learners’  text  box  entries  were  examined 
for  monitoring  episodes  (i.e.,  detected  comprehension  difficulties) 
as  well  as  the  cognitive  processes  of  repeating  and  elaborating  on 
the  content  of  the  instructional  explanations.  Two  independent 
raters  who  were  blind  to  the  hypotheses  and  conditions  coded  the 
text  box  entries  of  20  learners.  Interrater  reliability  as  determined 
by  Cohen’s  k  was  good  (k  =  .75).  In  case  of  divergence,  the 
coders  reexamined  the  respective  segments  and  made  a  joint 
decision.  As  interrater  reliability  was  good,  only  one  rater  coded 
the  rest  of  the  text  box  entries. 

Assessment  of  JOLs.  The  four  instructional  explanations  of 
the  computer-based  learning  environment  did  not  directly  relate  to 
each  other  such  that  the  succeeding  instructional  explanations 
could  help  learners  understand  the  respective  prior  instructional 
explanations.  Therefore,  in  contrast  to  Experiment  1  the  learners 
were  asked  to  judge  their  level  of  comprehension  after  each  of  the 
four  instructional  explanations.  The  item  was  the  same  as  the  one 
used  in  Experiment  1.  For  the  later  analyses,  the  four  JOLs  were 
averaged  (i.e.,  theoretical  min:  0;  theoretical  max:  5). 

Posttest:  Assessment  of  conceptual  knowledge.  A  posttest 
assessed  the  learners’  conceptual  knowledge  with  respect  to  the 
concepts  and  principles  that  were  covered  by  the  instructional 
explanations  in  the  computer-based  learning  environment.  The 
posttest  included  all  four  items  of  the  pretest  as  well  as  six  further 
open-ended  questions.  For  instance,  the  learners  were  asked  to 
explain  the  structure  of  the  atomic  shell  or  why  argon  ( 1 8  protons) 
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and  calcium  (20  protons)  nearly  have  the  same  mass  numbers  (ca. 
40).  Using  a  scoring  protocol,  the  learners’  answers  were  exam¬ 
ined  for  correct  arguments  for  each  question.  Two  independent 
raters  who  were  blind  to  the  hypotheses  and  conditions  scored  the 
written  answers  of  20  learners.  Interrater  reliability  as  determined 
by  the  intraclass  coefficient  with  measures  of  absolute  agreement 
was  very  good  for  each  of  the  10  questions  (all  ICC  >  .85).  Thus, 
only  one  rater  scored  the  rest  of  the  written  answers.  The  raw 
scores  were  summed  up  and  transformed  into  percentage  scores  of 
the  theoretical  maximum  number  of  points  that  could  have  been 
achieved  in  the  posttest  (i.e.,  40  points). 

Procedure.  The  participants  worked  individually  at  a  com¬ 
puter  with  headphones  in  group  sessions.  First,  all  learners  took  the 
pretest.  After  that,  the  learners  watched  either  the  multimedia 
presentation  about  the  dangers  of  making  overconfident  JOLs  or 
the  video  about  optical  illusions  depending  on  their  experimental 
condition.  Third,  the  learners  entered  the  computer-based  learning 
environment.  At  the  beginning,  all  learners  were  given  a  short 
introduction  that  was  similar  to  the  version  used  in  Experiment  1 . 
During  the  learning  phase,  the  participants  processed  the  instruc¬ 
tional  explanations  sequentially  at  their  own  pace  and  judged  their 
level  of  comprehension  after  each  explanation.  After  they  had 
completed  the  learning  environment,  all  participants  took  the  post¬ 
test.  In  the  final  step,  they  filled  in  a  demographics  questionnaire. 
The  experiment  lasted  approximately  1 .5  hr. 

Results 

Table  2  shows  the  mean  scores  and  standard  deviations  for  the 
two  groups  on  all  measures  of  the  study.  An  a-level  of  .05  was 
used  for  all  tests. 

Preliminary  analyses.  Before  addressing  our  research  ques¬ 
tions,  first  we  tested  whether  the  groups  differed  with  respect  to 
their  pretest  scores.  Surprisingly,  a  t  test  yielded  a  statistically 
significant  effect,  r(54.63)  =  2.99,  p  =  .004,  d  =  0.71.  The 
learners  in  the  control  group  reached  higher  scores  than  their 
counterparts  in  the  information-about-overconfidence  group.  How¬ 
ever,  the  average  score  was  2.9%  (SD  =  4.2%)  and  the  difference 
in  the  pretest  scores  between  the  two  groups  was  only  ca.  2%  (i.e., 
0.3  points  on  the  pretest);  therefore,  this  difference  is  probably  not 
pedagogically  relevant.  Rather,  it  might  be  a  result  of  very  low 
variances  within  the  groups,  which,  in  turn,  can  be  attributed  to  the 
fact  that  most  of  the  learners  (ca.  55%)  scored  zero  and  none  of 
them  scored  higher  than  17%  on  the  pretest.  To  control  for  the 
difference  in  pretest  scores,  we  nonetheless  included  the  pretest 
score  as  a  covariate  in  all  subsequent  analyses  (with  the  exception 

Table  2 


Means  (and  SDs)  of  All  Measures  in  Experiment  2 


Variable 

Information-about- 
overconfidence  group 

Control  group 

Pretest  (%) 

.01  (.03) 

.04  (.05) 

Judgment  of  learning 

2.32  (.84) 

2.88  (.80) 

Posttest  (%) 

.23  (.15) 

.24  (.15) 

Monitoring  episodes 

1.29(1.82) 

.47  (.74) 

Repetitions 

7.89  (8.63) 

9.75  (9.78) 

Elaborations 

1.00(1.31) 

.69  (1.85) 

Learning  time  (in  minutes) 

18.21  (10.04) 

20.03  (11.69) 

of  the  analyses  regarding  JOLs;  see  below).  The  assumption  of 
homogeneous  within  group  regression  slopes  was  not  violated  for 
all  analyses. 

Second,  as  a  type  of  implementation  check  we  analyzed  the 
learners  in  the  information-about-overconfidence  group’s  re¬ 
sponses  to  the  multiple-choice  questions  that  followed  the  multi- 
media  presentation.  We  found  that  96.2%  of  the  responses  were 
correct.  Furthermore,  none  of  the  participants  responded  incor¬ 
rectly  to  more  than  one  of  the  questions.  Jointly,  these  findings 
indicate  that  the  learners  understood  the  main  content  of  the 
presentation.  None  of  the  learners  were  excluded  from  the  further 
analyses. 

Monitoring  and  JOLs.  In  Research  Question  1,  we  were 
interested  in  whether  informing  learners  about  the  dangers  of 
making  overconfident  JOLs  would  increase  their  engagement  in 
comprehension  monitoring.  An  ANCOVA  indicated  a  statistically 
significant  effect,  F(  1,  68)  =  6.75,  p  =  .011,  T|p  =  .09.  The 
learners  in  the  information-about-overconfidence  group  detected  a 
higher  number  of  comprehension  difficulties  than  the  uninformed 
learners. 

In  Research  Question  2,  we  were  interested  in  whether  the  effect 
on  the  number  of  monitoring  episodes  would,  given  the  same  level 
of  acquired  conceptual  knowledge,  lead  the  informed  learners’  to 
make  lower  JOLs  than  the  learners  in  the  control  group.  To  address 
this  research  question,  in  the  first  step  we  analyzed  whether  the 
two  groups  differed  in  their  JOLs.  As  in  Experiment  1,  the  posttest 
score  was  included  as  a  covariate  in  this  analysis;  the  assumption 
of  homogeneous  within  group  regression  slopes  was  not  violated. 
The  ANCOVA  yielded  a  statistically  significant  effect  of  condi¬ 
tion,  F(l,  68)  =  9.48,  p  —  .003,  Tip  =  .12.  The  informed  learners 
showed  lower  JOLs  (Mestimated  =  2.33;  SE  =  0.13)  than  the 
uninformed  learners  (A/estimated  =  2.87;  SE  =  0.13). 

In  the  next  step,  we  analyzed  whether  this  effect  was  mediated 
through  the  effect  on  the  number  detected  comprehension  difficul¬ 
ties  (i.e.,  monitoring  episodes).  The  mediation  analysis  revealed  a 
statistically  significant  indirect  effect  of  the  number  of  monitoring 
episodes,  a  X  b  =  -0.131  (LCL  =  -0.291,  UCL  =  -0.010). 
Taken  together,  these  results  suggest  that  the  lower  JOLs  on  part 
of  the  informed  learners  were  mediated  via  the  higher  number  of 
comprehension  difficulties  they  had  detected  in  the  learning  phase. 

Posttest  scores  and  cognitive  processes.  In  Research  Ques¬ 
tion  3,  we  were  interested  in  whether  the  learners  in  the 
information-about-overconfidence  group  would  acquire  more  con¬ 
ceptual  knowledge  from  the  instructional  explanations  than  their 
counterparts  in  the  control  group.  With  respect  to  the  posttest 
scores,  the  ANCOVA  did  not  yield  a  statistically  significant  effect 
of  condition,  F(l,  68)  =  0.18,  p  =  .673,  t^  =  .00. 

Despite  this  lack  of  effect  regarding  the  posttest  scores,  in  the 
next  step  we  examined  whether  the  two  groups  differed  (a)  in  the 
number  and/or  (b)  the  focus  of  their  cognitive  processes  (Research 
Question  4).  Concerning  the  number  qf  cognitive  processes,  we 
did  not  find  a  statistically  significant  effect  between  the  experi¬ 
mental  conditions  regarding  the  number  of  content  repetitions, 
F(l,  68)  =  0.83,  p  =  .366,  Tip  =  .01,  the  number  of  elaborations, 
F(l,  68)  =  0.98,  p  =  .326,  t^  =  .01,  and  the  total  number  of 
cognitive  processes,  F(l,  68)  =  0.60,  p  =  .442,  tjJ  =  .01.  Fur¬ 
thermore,  with  respect  to  the  focus  of  the  learners’  cognitive 
processes,  (b)  there  were  no  statistically  significant  correlations 
between  posttest  scores  and  the  total  number  of  the  cognitive 


HOW  TO  FOSTER  COMPREHENSION  MONITORING 


109 


processes,  r  .22,  p  .220,  and  r  —  .27,  p  =  .110,  the  number 
of  content  repetitions,  r  =  .14,  p  =  .438,  and  r  =  .16,  p  =  .350, 
or  the  number  of  elaborations,  r  =  .  19,  p  —  .285,  and  r  =  .23,  p  = 
.176,  in  either  the  information-about-overconfidence  or  in  the 
control  group. 

For  exploratory  purposes,  we  also  analyzed  whether  the  two 
groups  differed  in  the  time  spent  working  on  the  instructional 
explanations.  The  ANCOVA  did  not  show  a  statistically  signifi¬ 
cant  effect  of  condition,  F(l,  68)  =  0.60,  p  =  .441,  =  .01. 

Discussion 

In  short,  the  findings  of  Experiment  2  suggest  that,  just  like  for 
university  students,  for  13-  to  15-year-old  high  school  students, 
insufficient  comprehension  monitoring  can  be  partly  attributed  to 
the  fact  that  these  learners  lack  conditional  knowledge  concerning 
comprehension  monitoring.  However,  unlike  for  university  stu¬ 
dents,  the  provision  of  this  conditional  knowledge  is  not  sufficient 
to  reduce  underachievement  for  13-  to  15-year  old  high  school 
students. 

As  was  the  case  for  university  students  (Experiment  1),  the  data 
show  that  raising  learner  awareness  of  the  high  frequency  and 
negative  consequences  of  making  overconfident  JOLs  fostered  the 
number  of  detected  comprehension  difficulties  in  a  subsequent 
learning  phase  (Research  Question  1).  This  effect  on  the  number  of 
detected  comprehension  difficulties,  in  turn,  caused  a  lowering  of 
the  informed  learners’  JOLs  (Research  Question  2).  Jointly,  these 
results  suggest  that  informing  13-  to  15 -year-old  high  school 
students  about  the  dangers  of  making  overconfident  JOLs  is  a 
viable  means  to  foster  comprehension  monitoring. 

However,  in  contrast  to  Experiment  1,  the  learners  in  the 
information-about-overconfidence  group  did  not  outperform  their 
counterparts  in  terms  of  the  acquisition  of  conceptual  knowledge 
from  the  instructional  explanations  (Research  Question  3).  One 
explanation  for  the  lack  of  effect  on  the  acquisition  of  conceptual 
knowledge  could  be  that  even  though  the  informed  learners  de¬ 
tected  more  comprehension  difficulties  than  the  uninformed  learn¬ 
ers,  the  average  number  of  detected  comprehension  difficulties 
was  too  low.  On  average  they  merely  detected  1.28  comprehension 
difficulties  while  processing  the  four  instructional  explanations. 
Remedying  this  low  number  of  comprehension  difficulties  might 
have  been  insufficient  to  yield  a  significant  advantage  on  the 
posttest. 

Similar  to  Experiment  1,  one  possible  explanation  for  the  low 
number  of  detected  comprehension  difficulties,  in  turn,  could  be 
that  the  learners  mainly  utilized  memory-based  cues  (see  Thiede  et 
al.,  2010).  Additionally,  the  rather  high  level  of  difficulty  of  the 
instructional  explanations  might  have  hindered  the  high  school 
students  from  detecting  comprehension  difficulties  as  well.  Al¬ 
though  the  instructional  explanations  were  designed  in  cooperation 
with  the  learners’  chemistry  teachers  (see  Method  section),  the 
learners’  posttest  scores  hardly  exceeded  20%.  Thus,  processing 
the  explanations  likely  entailed  a  high  level  of  cognitive  load  on 
the  learners’  working  memory.  As  the  students  at  this  age  were 
probably  not  very  experienced  in  monitoring  their  comprehension 
on  complex  tasks  because  this  skill  is  usually  developed  around  the 
age  of  12  (see  de  Bruin  &  Van  Gog,  2012),  it  is  reasonable  to 
assume  that  engaging  in  comprehension  monitoring  also  places 
high  demands  on  their  working  memory  capacity.  Consequently, 


given  that  working  memory  capacity  is  limited  (Baddeley,  1986), 
the  learners  might  have  become  overloaded  at  some  point  and  were 
not  able  to  invest  high  levels  of  effort  in  comprehension  monitor¬ 
ing  even  if  they  might  have  been  willing  to  do  so. 

However,  the  lack  of  effect  regarding  the  acquisition  of  concep¬ 
tual  knowledge  could  also,  at  least  in  part,  be  because  of  the  fact 
that  the  high  school  students  were  unable  to  use  the  information 
regarding  their  comprehension  difficulties  to  take  beneficial  reg¬ 
ulation  decisions.  In  line  with  this  explanation,  our  findings  sug¬ 
gest  that  there  was  no  significant  difference  between  the  groups 
with  respect  to  learning  time,  the  number  of  cognitive  processes,  or 
the  focus  of  the  cognitive  processes  (Research  Question  4;  similar 
to  Experiment  1,  the  within-group  correlations  should  be  inter¬ 
preted  cautiously  because  of  the  relatively  low  number  of  partic¬ 
ipants).  Hence,  although  the  informed  learners  were  more  aware  of 
their  comprehension  difficulties,  they  did  not  effectively  adapt 
their  cognitive  processes  to  overcome  them. 

It  is  important  to  note  that  this  interpretation  should  not  suggest 
that  the  learners  did  not  use  the  knowledge  regarding  their  com¬ 
prehension  difficulties  to  regulate  their  cognitive  processes  at  all. 
Instead,  the  cognitive  processes  they  chose  to  engage  in  might 
have  been  insufficient  to  remedy  the  respective  difficulties.  For 
instance,  in  comparison  to  Experiment  1  the  number  of  elabora¬ 
tions  per  monitoring  episode  in  the  information-about- 
overconfidence  group  was  relatively  low  in  Experiment  2  (Exper¬ 
iment  1  =  1.31;  Experiment  2  =  0.78).  Therefore,  in  line  with  the 
notion  that  the  ability  to  beneficially  regulate  learning  develops 
later  than  the  ability  to  monitor  one’s  comprehension  (De  Bruin  et 
al.,  201 1)  it  is  possible  that  the  lack  of  effect  on  the  acquisition  of 
conceptual  knowledge  could  be  the  result  of  a  lack  of  awareness  of 
effective  regulative  strategies,  particularly  ones  that  entail  deep- 
oriented  cognitive  processes  (such  as  generating  elaborations).  If 
this  is  the  case,  then  informing  13-  to  15-year-old  high  school 
students  about  the  dangers  of  making  overconfident  JOLs  should 
be  more  beneficial  in  terms  of  the  acquisition  of  conceptual  knowl¬ 
edge  if  they  also  receive  information  about  effective  (deep- 
oriented)  regulation  strategies. 

Experiment  3 

In  view  of  the  results  of  Experiment  2,  the  main  goal  of 
Experiment  3  was  to  examine  whether  informing  13-  to  15-year- 
old  high  school  students  about  the  dangers  of  making  overconfi¬ 
dent  JOLs  actually  fails  to  foster  the  acquisition  of  conceptual 
knowledge  in  a  subsequent  learning  phase  because  the  learners  are 
insufficiently  aware  of  effective  regulation  strategies.  To  address 
this  question,  we  varied  whether  learners  were  informed  (a)  about 
the  dangers  of  making  overconfident  JOLs  and  (b)  about  effective 
regulation  strategies  before  they  processed  instructional  explana¬ 
tions  that  introduced  them  to  the  topic  of  atomic  structure.  We 
focused  on  the  same  dependent  variables  as  in  Experiments  1  and 

2,  but  modified  Research  Questions  3  and  4.  In  Research  Question 

3,  we  were  particularly  interested  in  whether  the  acquisition  of 
conceptual  knowledge  from  the  instructional  explanations  would 
be  fostered  by  the  combination  of  informing  learners  about  the 
dangers  of  making  overconfident  JOLs  and  effective  regulation 
strategies.  Correspondingly,  in  Research  Question  4  we  explored 
whether  the  potential  superior  results  of  learners  who  are  informed 
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about  both  topics  would  be  because  of  an  increase  in  the  number 
and/or  a  change  in  the  focus  of  their  cognitive  processes. 

Method 

Sample  and  design.  The  participants  of  this  experiment  were 
N  =  97  eighth-grade  students  from  different  classes  of  a  German 
high-track  high  school  (51  girls,  46  boys;  mean  age  =  13.53 
years).  The  sample  size  was  determined  by  an  analysis  of  power 
with  the  following  parameters:  a  =  .05;  power  =  .80;  T|p  =  .09 
(i.e.,  the  effect  size  regarding  monitoring  from  Experiment  2).  This 
analysis  suggested  including  at  least  21  participants  per  condition. 
The  parents  of  all  students  gave  consent  for  their  participation. 

We  randomly  assigned  the  participants  to  one  condition  of  a  2  X 
2  factorial  between-subjects  design  with  the  factors  information 
about  the  dangers  of  making  overconfident  JOLs  (with  vs.  without) 
and  information  about  regulation  strategies  (with  vs.  without). 
Accordingly,  before  they  received  written  instructional  explana¬ 
tions  relating  to  the  topic  of  atomic  structure,  the  learners  received 
either  (a)  information  about  the  dangers  of  making  overconfident 
JOLs  (i.e.,  information-about-overconfidence  condition',  n  =  25) 
or  (b)  information  about  regulation  strategies  (i.e.,  information- 
about-regulation-strategies  condition',  n  =  25)  or  (c)  information 
about  both  topics  (i.e.,  information-about-overconfidence-and- 
regulation- strategies  condition',  n  =  22)  or  (d)  no  such  information 
(i.e.,  control  condition',  n  =  25). 

Materials. 

Information  about  the  dangers  of  making  overconfident  JOLs. 
We  used  the  same  multimedia  presentation  as  in  Experiment  2. 

Video  about  optical  illusions.  The  learners  in  the  conditions 
without  information  about  the  dangers  of  making  overconfident 
JOLs  watched  the  same  video  about  optical  illusions  and  took  the 
same  multiple-choice  questions  as  the  learners  in  the  control 
condition  of  Experiment  2. 

Information  about  regulation  strategies.  We  designed  a  mul¬ 
timedia  presentation  to  inform  the  participants  in  the  information- 
about-regulation-strategies  conditions  about  effective  strategies  they 
could  use  if  they  encounter  comprehension  difficulties  while  process¬ 
ing  expository  texts  or  instructional  explanations.  During  the  presen¬ 
tation,  the  participants  were  first  told  that  a  commonly  used  but 
relatively  ineffective  strategy  to  overcome  comprehension  difficulties 
is  to  reread  or  repeat  the  respective  content.  Then,  they  were  intro¬ 
duced  to  two  more  effective  regulation  strategies:  Explaining  content 
using  only  one’s  own  words  and  thinking  content  through  using  one’s 
own  examples.  Both  introductions  started  by  informing  the  learners 
about  the  functional  value  of  the  respective  strategy.  For  the  first 
strategy,  the  participants  were  told  that  explaining  content  in  one’s 
own  words  leads  to  more  effective  integration  of  the  targeted  learning 
content  with  their  prior  knowledge  because  it  prevents  them  from 
memorizing  things  they  do  not  understand.  Then,  they  were  briefly 
informed  about  how  to  explain  content  using  only  their  own  words. 
That  is,  they  should  only  use  terms  or  formulations  that  are  compre¬ 
hensible  to  both  themselves  as  well  as  a  classmate  who  is  not  familiar 
with  the  original  learning  material.  After  that,  the  learners  received 
information  on  the  functional  value  of  thinking  content  through  using 
their  own  examples,  namely  that  they  can  lead  to  a  deeper  under¬ 
standing  of  the  respective  content  and  reduced  forgetting  rates.  Next, 
they  received  brief  information  on  how  to  perform  this  strategy.  They 
were  told  that  they  should  ask  themselves  how  the  respective  learning 


content  could  be  applied  to  a  new  example  and  that  they  should 
describe  their  examples  in  detail. 

At  the  end  of  the  presentation,  the  learners  were  informed  that 
they  could  review  the  information  regarding  the  two  strategies 
whenever  they  clicked  on  the  comprehension  difficulty  button, 
which  appeared  on  each  page  of  the  subsequent  learning  environ¬ 
ment.  The  presentation  lasted  approximately  5  min. 

Multimedia  presentation  about  e-learning.  To  parallelize  the 
conditions  in  terms  of  the  time  spent  before  processing  the  instruc¬ 
tional  explanations  after  watching  either  the  multimedia  presenta¬ 
tion  about  the  dangers  of  making  overconfident  JOLs  or  the  video 
about  optical  illusions,  the  learners  in  the  conditions  without 
information  about  regulation  strategies  had  to  watch  an  additional 
multimedia  presentation  as  well.  Specifically,  these  learners  were 
presented  a  historical  overview  of  e-learning,  which  started  with 
Ramelli’s  book  wheel  and  finished  with  today’s  Web  based  train¬ 
ings.  The  presentation  was  based  on  material  by  Hefter  et  al. 
(2015)  and  lasted  approximately  5  min. 

Computer-based  learning  environment:  Written  instructional 
explanations  related  to  the  topic  atomic  structure.  The  design  of 
the  instructional  explanations  was  in  most  respects  similar  between 
Experiments  2  and  3.  We  used  the  same  instructional  explanations 
and  engaging  prompt  and  the  experiment  was  carefully  timed  in 
coordination  with  the  participants’  chemistry  teachers.  The  only 
difference  in  Experiment  3  was  that  each  instructional  explanation 
page  included  the  aforementioned  comprehension  difficulties  but¬ 
ton  in  the  conditions  with  the  information  about  regulation  strat¬ 
egies  (see  Figure  3). 

Instruments  and  Measures. 

Pretest:  Assessment  of  prior  conceptual  knowledge.  The  pre¬ 
test  was  identical  to  the  one  used  in  Experiment  2.  Two  independent 
raters  scored  the  written  answers  of  20  learners.  Interrater  reliability  as 
determined  by  the  intraclass  coefficient  with  measures  of  absolute 
agreement  was  very  good  for  each  of  the  four  questions  (all  ICC  > 
.85).  Thus,  only  one  rater  scored  the  rest  of  the  written  answers.  The 
raw  scores  were  summed  up  and  transformed  into  percentage  scores 
of  the  theoretical  maximum  number  of  points  that  could  have  been 
achieved  on  the  pretest  (i.e.,  15  points). 

Assessment  of  learning  processes.  Using  the  same  coding 
scheme  as  in  Experiments  1  and  2,  the  learners’  text  box  entries 
were  examined  for  monitoring  episodes,  content  repetitions,  and 
elaborations.  Two  independent  raters  who  were  blind  to  the  hy¬ 
potheses  and  conditions  coded  the  text  box  entries  of  20  learners. 
Interrater  reliability  as  determined  by  Cohen’s  k  was  good  (k  = 
.79).  In  case  of  divergence,  the  coders  reexamined  the  respective 
segments  and  made  a  joint  decision.  As  interrater  reliability  was 
good,  only  one  rater  coded  the  rest  of  the  text  box  entries. 

Assessment  of  JOLs.  The  JOLs  were  assessed  the  same  way 
as  in  Experiment  2. 

Posttest:  Assessment  of  conceptual  knowledge.  The  posttest 
was  identical  to  the  one  used  in  Experiment  2.  Two  independent 
raters  who  were  blind  to  the  hypotheses  and  conditions  scored  the 
written  answers  of  20  learners.  Interrater  reliability  as  determined 
by  the  intraclass  coefficient  with  measures  of  absolute  agreement 
was  very  good  for  each  of  the  10  questions  (all  ICC  >  .85).  Thus, 
only  one  rater  scored  the  rest  of  the  written  answers.  The  raw 
scores  were  summed  up  and  transformed  into  percentage  scores  of 
the  theoretical  maximum  number  of  points  that  could  have  been 
achieved  in  the  posttest  (i.e.,  40  points). 
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The  Atomic  Shell 

The  Bohr  model  depicts  the  atomic  shell  in  the  following  way:  The  atomic  shell  consists  exclusively  of  electrons.  The  charge 
of  an  electron  is  exactly  the  same  as  the  charge  of  a  proton  but  is  negatively  charged.  In  a  normal  state,  the  negative  charge  of 
the  atomic  shell  matches  the  positive  charge  of  the  atomic  nucleus,  that  is,  the  number  of  positively  charged  protons  in  the 
nucleus  of  an  atom  is  equal  to  the  number  of  negatively  charged  electrons  in  the  shell. 

The  electrons  are  not  dispersed  randomly  around  the  nucleus,  but  are  ordered  in  shell-like  fashion  around  the  nucleus  (see 
graphic  below).  Each  of  the  seven  possible  electron  shells  can  hold  a  limited  number  of  electrons.  The  innermost  shell — the 
K-shell  can  hold  only  two  electrons.  For  the  elements  that  belong  to  the  first  three  periods,  the  second  shell — the  L-shell — 
can  hold  up  to  eight  electrons  and  the  third  shell — the  M-shell — can  also  hold  up  to  eight  electrons. 

The  shells  are  filled  moving  from  the  innermost  shell  outwards.  Each  type  of  atom  has  the  lowest  possible  number  of  shells 
that  are  needed  to  accommodate  all  of  its  electrons. 

electron  Write  down  your  thoughts  on  the  explanation. 


L-shell 

Graphic:  Schematic  representation  of  a  shell  model 


Comprehension  difficulty? 


Next  page 


Figure  3.  Screenshot  of  an  instructional  explanation  in  Experiment  3  (translated  from  German).  See  the  online 
article  for  the  color  version  of  this  figure. 


Procedure.  With  the  exception  that  all  learners  watched  two 
multimedia  presentations/videos  before  entering  the  learning  en¬ 
vironment,  the  procedure  was  identical  to  the  one  used  in  Exper¬ 
iment  2.  The  multimedia  presentation  about  the  dangers  of  making 
overconfident  JOLs/the  video  on  optical  illusions  preceded  the 
multimedia  presentations  about  regulation  strategies/e-leaming. 
The  experiment  lasted  approximately  1.5  hr. 

Results 

Table  3  shows  the  mean  scores  and  standard  deviations  for  the 
four  groups  on  all  measures  of  the  study.  An  a-level  of  .05  was 
used  for  all  tests. 

Preliminary  analyses.  Before  addressing  our  research  ques¬ 
tions,  first  we  tested  whether  the  groups  differed  in  their  pretest 
scores.  A  2  X  2  factorial  ANOVA  did  not  reveal  a  statistically 


significant  main  effect  of  the  two  factors  information  about  the 
dangers  of  making  overconfident  JOLs,  F(  1 , 93)  =  0.20,/?  =  .652, 
T|p  =  .00,  and  information  about  regulation  strategies,  F(  1 ,  93)  = 
0.29,  p  =  .591,  rip  =  .00,  nor  did  it  yield  a  statistically  significant 
interaction  effect,  F(l,  93)  =  0.73,  p  =  .394,  T]p  =  .01.  On 
average,  the  learners  reached  4.7%  ( SD  =  7.4%)  of  the  theoretical 
maximum  score.  This  indicates  that,  as  in  Experiment  2,  they  had 
low  prior  conceptual  knowledge.  Nevertheless,  to  reduce  error 
variance,  we  included  the  pretest  scores  as  a  covariate  in  all 
subsequent  analyses  (with  the  exception  of  the  analyses  regarding 
JOLs;  see  below).  For  all  analyses,  the  assumption  of  homoge¬ 
neous  within  group  regression  slopes  was  not  violated. 

Second,  as  a  type  of  implementation  check  regarding  the  factor 
information  about  the  dangers  of  making  overconfident  JOLs,  we 
analyzed  the  learners’  responses  to  the  multiple-choice  questions 


Table  3 


Means  (and  SDs)  of  All  Measures  in  Experiment  3 


Variable 

Information-about- 

overconfidence 

group 

Information-about- 

regulation-strategies 

group 

Information-about- 

overconfidence-and- 

regulation-strategies 

group 

Control  group 

Pretest  (%) 

.05  (.08) 

.05  (.09) 

.05  (.06) 

.03  (.06) 

Judgment  of  learning 

3.13(1.08) 

3.18  (.85) 

3.37  (.83) 

3.09  (.89) 

Posttest  (%) 

.36  (.25) 

.30  (.18) 

.47  (.21) 

.31  (.21) 

Monitoring  episodes 

1.00(1.91) 

.16  (.37) 

.95  (1.67) 

.64  (.95) 

Repetitions 

7.44(8.10) 

9.64  (8.39) 

9.64(9.12) 

5.96  (7.92) 

Elaborations 

.60(1.00) 

2.32  (2.43) 

2.04  (2.38) 

1.28(1.51) 

Learning  time  (in  minutes) 

18.33  (6.85) 

21.72(10.60) 

20.62  (8.96) 

19.21  (10.70) 
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that  followed  the  presentation  about  the  dangers  of  making  over¬ 
confident  JOLs.  We  found  that  95.7%  of  the  responses  were 
correct.  Furthermore,  only  one  of  the  participants  responded  in¬ 
correctly  to  more  than  one  of  the  questions  (i.e.,  two  questions). 
Jointly,  these  findings  indicate  that  the  learners  understood  the 
main  content  of  the  presentation.  None  of  the  learners  were  ex¬ 
cluded  from  the  further  analyses. 

Monitoring  and  JOLs.  In  Research  Question  1,  we  were 
interested  whether  informing  learners  about  the  dangers  of  making 
overconfident  JOLs  would  foster  their  engagement  in  comprehen¬ 
sion  monitoring,  which  should  be  reflected  in  an  increase  in  the 
number  of  detected  comprehension  difficulties.  The  ANCOVA 
yielded  a  statistically  significant  main  effect  of  information  about 
the  dangers  of  making  overconfident  JOLs,  F(l,  92)  =  4.45,  p  = 
.038,  T|p  =  .05.  The  learners  who  were  informed  about  the  dangers 
of  making  overconfident  JOLs  detected  more  comprehension  dif¬ 
ficulties  than  their  counterparts.  There  was  no  statistically  signif¬ 
icant  main  effect  of  information  about  regulation  strategies,  F(l, 
92)  =  0.82,  p  =  .366,  Tip  =  .01.  We  also  did  not  find  a  statistically 
significant  interaction  effect,  F(l,  92)  =  0.52,  p  =  .473,  T|p  =  .01. 

In  Research  Question  2,  we  were  interested  in  whether  the  effect 
on  the  number  of  detected  comprehension  difficulties  would,  given 
the  same  level  of  acquired  conceptual  knowledge,  lead  the  in¬ 
formed  learners  to  forming  lower  JOLs  than  the  uninformed  learn¬ 
ers.  The  posttest  score  was  included  as  a  covariate  in  this  analysis; 
the  assumption  of  homogeneous  within  group  regression  slopes 
was  not  violated.  The  ANCOVA  revealed  a  statistically  significant 
main  effect  of  information  about  the  dangers  of  making  overcon¬ 
fident  JOLs,  F(l,  92)  =  4.21,  p  =  .043,  r)2p  =  .04.  The  informed 
learners  showed  lower  JOLs  (Mestimated  =  3.05;  SE  =  0.09)  than 
their  uninformed  counterparts  (Mestjmated  =  3.31;  SE  =  0.08). 
There  was  no  statistically  significant  main  effect  of  information 
about  regulation  strategies,  F(l,  92)  =  0.01,  p  =  .914,  pp  =  .00. 
Furthermore,  there  was  no  statistically  significant  interaction,  F(l, 
92)  =  0.96,  p  =  .330,  r)2  =  .01. 

In  the  next  step,  we  analyzed  whether  this  lowering  of  the 
informed  learners’  JOLs  was  mediated  via  the  higher  number  of 
detected  comprehension  difficulties.  The  mediation  analysis 
showed  a  statistically  significant  indirect  effect  of  the  number  of 
detected  comprehension  difficulties,  a  X  b  =  —0.072 

(LCL  =  -0.189,  UCL  =  -0.006).  Jointly,  these  results  indicate 
that  the  lower  JOLs  on  part  of  the  learners  in  the  information- 
about-overconfidence  groups  were  mediated  through  the  higher 
number  of  comprehension  difficulties  they  had  detected  in  the 
learning  phase. 

Posttest  scores  and  cognitive  processes.  In  Research  Ques¬ 
tion  3,  we  were  interested  in  whether  the  learners  who  were  also 
informed  about  effective  regulation  strategies  would  benefit  from 
being  informed  about  the  dangers  of  making  overconfident  JOLs 
in  terms  of  acquiring  conceptual  knowledge  from  the  explanations. 
Concerning  the  posttest  scores,  the  ANCOVA  did  not  reveal  a 
statistically  significant  effect  of  information  about  regulation  strat¬ 
egies,  F(l,  92)  =  0.85,  p  =  .358,  tip  =  .01.  By  contrast,  it  showed 
a  statistically  significant  effect  of  information  about  the  dangers  of 
making  overconfident  JOLs,  F(l,  92)  =  7.26, p  =  .008,  T]p  =  .07. 
This  finding  suggests  that  the  learners  in  the  groups  with  informa¬ 
tion  about  the  dangers  of  making  overconfident  JOLs  outper¬ 
formed  their  counterparts.  However,  the  main  effect  was  qualified 
by  a  statistically  significant  interaction  effect,  F(l,  92)  —  3.98,  p  = 


.049,  tip  =  .04.  As  depicted  in  Figure  4,  the  interaction  pattern 
indicates  that  the  information  about  the  dangers  of  making  over¬ 
confident  JOLs  was  particularly  effective  for  the  learners  who  also 
received  information  about  regulation  strategies.  Post  hoc  contrasts 
support  this  interpretation.  The  learners  in  the  information-about- 
overconfidence-and-regulation-strategies  group  outperformed  the 
learners  in  the  groups  without  information  about  the  dangers  of 
making  overconfident  JOLs,  F(l,  69)  =  11.71,/?  =  .001,  rip  =  .14, 
whereas  the  learners  in  the  information-about-overconfidence 
group  did  not  outperform  them,  F(l,  72)  =  1.05,  p  =  .308,  r\l  = 
.0 1 .  Furthermore,  the  learners  in  the  information-about- 
overconfidence-and-regulation-strategies  group  also  outperformed 
the  learners  in  the  information-about-overconfidence  group,  F(l, 
44)  =  4.71,/?  =  .035,  tig  =  .09. 

In  the  next  step,  we  examined  whether  the  superiority  of  the 
learners  in  the  information-about-overconfidence-and-regulation- 
strategies  group  was  mediated  (a)  via  a  higher  number  of  cognitive 
processes  and/or  (b)  a  change  in  the  focus  of  their  cognitive 
processes  (Research  Question  4).  Concerning  (a),  with  respect  to 
the  number  of  content  repetitions,  we  did  not  find  a  statistically 
significant  main  effect  of  the  factors  information  about  the  dangers 
of  making  overconfident  JOLs,  F(l,  92)  =  0.16,  p  =  .687,  rip  = 
.00,  and  information  about  regulation  strategies,  F(l,  92)  =  2.84, 
p  —  .095,  r)p  =  .03,  nor  did  we  observe  a  statistically  significant 
interaction,  F(l,  92)  =  0.14 ,  p  =  .705,  t^  =  .00.  When  it  came  to 
the  number  of  elaborations,  we  found  a  different  pattern  of  results. 
The  ANCOVA  did  not  show  a  statistically  significant  main  effect 
of  information  about  the  dangers  of  making  overconfident  JOLs, 
F(l,  92)  =  1.68,  p  =  .  199,  r|p  =  .02.  By  contrast,  we  did  find  a 
statistically  significant  main  effect  of  information  about  regulation 
strategies,  F(l,  92)  =  9.81,  p  =  .002,  t^  =  .10.  The  learners  who 
received  the  information  about  effective  regulation  strategies 
showed  more  elaborations  than  their  counterparts.  There  was  no 
statistically  significant  interaction  effect,  F(l,  92)  =  0.41,  p  — 
.524,  Tip  —  .00.  Looking  at  the  total  number  of  cognitive  processes, 
we  found  a  similar  pattern  of  results.  There  was  no  statistically 
significant  main  effect  of  information  about  the  dangers  of  making 
overconfident  JOLs,  F(l,  92)  =  0.01,/?  =  .919,  T|j;  =  .00,  whereas 
there  was  a  statistically  significant  main  effect  of  information 
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Figure  4.  Interaction  between  information  about  the  dangers  of  making 
overconfident  judgments  of  learning  (JOLs)  and  information  about  regu¬ 
lation  strategies  regarding  posttest  scores.  Circles  indicate  the  means  of  the 
conditions  without  information  about  the  dangers  of  making  overconfident 
JOLs,  squares  show  the  means  of  the  conditions  with  information  about  the 
dangers  of  making  overconfident  JOLs.  Error  bars  represent  SEMs. 
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about  regulation  strategies,  F(  1,  92)  =  4.90,  p  =  .029,  r)“  =  .05. 
The  learners  who  were  informed  about  effective  regulation  strat¬ 
egies  showed  more  cognitive  processes  than  the  uninformed  learn¬ 
ers.  We  did  not  find  a  statistically  significant  interaction  effect, 
F(l,92)  =  0.05,  p  =  .828,  =  .00. 

Concerning  (b),  neither  in  the  information-about- 
overcontidence-and-regulation-strategies  group  nor  in  the  other 
three  groups  we  did  we  find  a  statistically  significant  correlation 
between  the  number  of  repetitions  and  posttest  scores,  r  =  -.05, 
p  —  .845,  and  r  =  .1 1,  p  =  .324,  or  the  total  number  of  cognitive 
processes  and  posttest  scores,  r  =  .05,  p  =  .826,  and  r  =  .14,  p  = 
.220.  However,  in  the  information-about-overconfidence-and- 
regulation-strategies  group  we  found  a  statistically  significant  pos¬ 
itive  correlation  between  the  number  of  elaborations  and  the  post¬ 
test  scores,  r  =  .39,  p  =  .042  (one-sided),  whereas  there  was  no 
such  correlation  in  the  other  three  groups,  r  =  .18,  p  =  .123. 

In  light  of  these  findings,  we  then  analyzed  whether  there  would 
be  a  conditional  mediation  effect  from  informing  learners  about 
effective  regulation  strategies  via  the  number  of  elaborations  such 
that  the  effectiveness  of  generating  elaborations  was  more  bene¬ 
ficial  for  the  learners  who  also  received  information  about  the 
dangers  of  making  overconfident  JOLs.  We  addressed  this  poten¬ 
tial  mediated  moderation  effect  by  regressing  the  posttest  scores  on 
the  factor  information  about  regulation  strategies,  the  potential 
moderator  information  about  the  dangers  of  making  overconfident 
JOLs,  the  number  of  elaborations  (the  potential  mediator),  and  the 
interactions  between  (a)  information  about  regulation  strategies 
and  the  moderator  and  (b)  the  mediator  and  the  moderator  in  a 
simultaneous  multiple  regression  model  (see  Muller,  Judd,  & 
Yzerbyt,  2005;  Preacher,  Rucker,  &  Hayes,  2007).  The  pretest 
scores  were  included  as  a  covariate. 

The  overall  regression  model  was  statistically  significant,  F( 6, 
90)  =  9.15,  p  <  .001.  The  results  for  the  separate  predictors  are 
shown  in  Table  4.  The  statistically  significant  positive  regression 
weight  of  the  number  of  elaborations  (b4)  reflects  a  beneficial 
partial  effect  of  the  number  of  elaborations  on  the  posttest  scores. 
However,  the  statistically  significant  regression  weight  of  the 
interaction  between  information  about  the  dangers  of  making 
overconfident  JOLs  (the  moderator)  and  the  number  of  elabora¬ 
tions  (b5)  shows  that  the  indirect  effect  of  the  number  of  elabora- 


Table  4 

Regression  Results  for  Estimated  Coefficients  of  the  Moderated 
Mediation  Model 


Dependent  variable  model:  Posttest  scores 

Predictor 

b 

SE 

Constant  (b0) 

-.04 

.02 

Pretest  scores  (b,) 

1.21* 

.25 

Information  about  regulation  strategies  (b2)  (with:  1; 

without:  —1) 

-.00 

.02 

Information  about  the  dangers  of  making  overconfident 

JOLs  (b3)  (with:  1;  without:  -1) 

.03 

.02 

Elaborations  (b4) 

.03* 

.01 

Information  about  the  dangers  of  making  overconfident 

JOLs  X  Elaborations  (b5) 

.02* 

.01 

Information  about  regulation  strategies  X  Information 

about  the  dangers  of  making  overconfident  JOLs  (b6) 

.02 

.02 

*  p  <  .05. 


tions  on  the  posttest  scores  depended  on  whether  the  learners  were 
informed  about  the  dangers  of  making  overconfident  JOLs.  Spe¬ 
cifically,  the  positive  regression  weight  shows  that  the  effective¬ 
ness  of  generating  elaborations  was  higher  for  the  informed  group. 

The  model  also  shows  that  the  interaction  between  information 
about  regulation  strategies  and  information  about  the  dangers  of 
making  overconfident  JOLs  was  not  statistically  significant  (b6). 
Together  with  the  aforementioned  results,  this  finding  is  consistent 
with  mediated  moderation  (see  Muller  et  al.,  2005).  Thus,  the 
results  suggest  that  the  interaction  between  information  about  the 
dangers  of  making  overconfident  JOLs  and  information  about 
regulation  strategies  regarding  posttest  scores  was  mediated  by  (a) 
the  effect  of  information  about  regulation  strategies  on  the  number 
of  elaborations  and  (b)  the  fact  that  the  partial  effect  of  the  number 
of  elaborations  depended  on  whether  the  learners  had  received 
information  about  the  dangers  of  making  overconfident  JOLs. 

For  exploratory  purposes,  we  also  analyzed  whether  the  condi¬ 
tions  differed  regarding  the  time  spent  working  on  the  instructional 
explanations.  The  ANCOVA  did  not  yield  a  statistically  significant 
main  effect  for  the  factors  information  about  the  dangers  of  mak¬ 
ing  overconfident  JOLs,  F(l,  92)  =  0.27,  p  =  .605,  r|p  =  .00,  and 
information  about  regulation  strategies,  F(l,  92)  =  1.53,  p  =  .219, 
=  .02,  nor  did  it  reveal  a  statistically  significant  interaction 
effect,  F(l,  92)  =  0.00,  p  =  .962,  =  .00.  Thus,  although 

learning  time  was  positively  correlated  with  the  posttest  scores, 
r  —  .28,  p  =  .006,  the  superiority  of  the  learners  in  the 
information-about-overconfidence-and-regulation-strategies  group 
cannot  be  simply  attributed  to  more  learning  time. 

Discussion 

In  short,  the  findings  of  Experiment  3  underpin  the  notion  that 
for  13-  to  15-year-old  high  school  students,  insufficient  compre¬ 
hension  monitoring  is  due  in  part  to  the  fact  that  these  learners  lack 
conditional  knowledge  concerning  comprehension  monitoring. 
Furthermore,  the  results  suggest  that  the  provision  of  this  condi¬ 
tional  knowledge  does  not  reduce  underachievement  in  learners  at 
this  age  because  they  are  insufficiently  aware  of  effective  regula¬ 
tion  strategies. 

Similar  to  Experiments  1  and  2,  we  found  that  the  learners  in  the 
information-about-overconfidence  groups  detected  a  higher  num¬ 
ber  of  comprehension  difficulties  while  processing  the  instruc¬ 
tional  explanations  relating  to  the  topic  atomic  structure  than  the 
learners  in  the  groups  without  this  information.  This,  in  turn, 
brought  about  a  decrease  in  JOLs  (Research  Questions  1  and  2). 
This  decrease,  however,  was  accompanied  by  an  increase  in  the 
conceptual  knowledge  the  learners  acquired  from  the  explanations 
only  in  the  information-about-overconfidence-and-regulation- 
strategies  group  (Research  Question  3). 

A  possible  explanation  for  this  pattern  of  results  is  as  follows:  The 
learners  in  both  groups  that  received  the  information  about  the  dan¬ 
gers  of  making  overconfident  JOLs  were  more  aware  of  their  knowl¬ 
edge  gaps  than  the  learners  without  this  information.  However,  be¬ 
cause  they  lacked  knowledge  pertaining  to  effective  regulation 
strategies  for  (substantial)  comprehension  difficulties,  similar  to  Ex¬ 
periment  2  the  learners  in  the  information-about-overconfidence 
group  were  not  able  to  take  beneficial  regulation  decisions  and  benefit 
from  the  detected  comprehension  difficulties  in  terms  of  the  acquisi¬ 
tion  of  conceptual  knowledge.  By  contrast,  because  they  were  given 
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information  about  both  the  dangers  of  making  overconfident  JOLs 
and  effective  regulation  strategies,  the  learners  in  the  information- 
about-overconfidence-and-effective-regulation-strategies  group  were 
able  effectively  remedy  (at  least  some  of)  their  comprehension  diffi¬ 
culties  by  elaborating  on  learning  content  they  identified  as  being  not 
yet  well  understood  (Research  Question  4).  Consequently,  they  out¬ 
performed  both  their  counterparts  in  the  information-about- 
overconfidence  group  as  well  as  the  learners  in  the  groups  without  the 
information  about  the  dangers  of  making  overconfident  JOLs  on  the 
posttest. 

Our  mediated  moderation  analysis  supports  this  interpretation. 
This  analysis  shows  that  the  learners  in  the  information-about- 
overconfidence-and-effective-regulation-strategies  group’ s  superi¬ 
ority  in  terms  of  the  acquisition  of  conceptual  knowledge  in 
comparison  to  the  other  three  groups  was  because  of  the  fact  that 
the  generation  of  elaborations  was  particularly  effective  for  the 
learners  who  had  been  informed  about  the  dangers  of  making 
overconfident  JOLs.  An  explanation  for  this  mediated  moderation 
effect  is  that  the  learners  who  were  informed  about  the  dangers  of 
making  overconfident  JOLs  focused  their  elaborations  on  content 
they  had  not  yet  well  understood  more  than  the  learners  in  the 
groups  without  this  information.  Because  of  the  relatively  low 
number  of  participants,  however,  this  interpretation  of  the  medi¬ 
ated  moderation  analysis  should  be  treated  cautiously.  We  also 
found  that  the  learners  who  did  not  receive  information  about 
effective  regulation  strategies  showed  relatively  few  elaborations. 
This  appears  to  be  why  the  learners  in  the  information-about- 
overconfidence  group  did  not  acquire  more  conceptual  knowledge 
from  the  explanations  than  the  learners  who  were  not  informed 
about  the  dangers  of  making  overconfident  JOLs. 

This  pattern  of  results  contradicts  the  explanation  put  forth  in 
Experiment  2  that  the  number  of  detected  comprehension  difficul¬ 
ties  might  have  been  too  low  to  yield  significant  regulation  ben¬ 
efits.  Although,  on  average,  the  learners  in  the  information-about- 
overconfidence-and-regulation-strategies  group  detected  even 
fewer  comprehension  difficulties  than  the  learners  in  the 
information-about-overconfidence  group  of  Experiment  2,  they 
benefited  from  this  awareness  of  their  comprehension  difficulties 
in  terms  of  the  acquisition  of  conceptual  knowledge. 

General  Discussion 

In  summary,  the  present  research  includes  the  following  main 
conclusions  regarding  the  design  of  interventions  to  address  insuf¬ 
ficient  comprehension  monitoring  and  the  concomitant  under¬ 
achievement:  (a)  For  both  university  and  13-  to  15-year  old  high 
school  students,  insufficient  comprehension  monitoring  is  due  in 
part  to  the  fact  that  the  learners  lack  conditional  knowledge  re¬ 
garding  comprehension  monitoring.  They  do  not  know  why  they 
should  invest  significant  effort  in  this  process.  Thus,  interventions 
designed  to  foster  comprehension  monitoring  should  focus  not 
only  on  fostering  learners’  monitoring  skills  but  also  on  fostering 
learners’  will  to  engage  in  comprehension  monitoring,  (b)  Al¬ 
though  the  provision  of  conditional  knowledge  regarding  compre¬ 
hension  monitoring  (in  the  present  studies:  information  about  the 
dangers  of  making  overconfident  judgments  of  learning)  is  a 
promising  means  to  foster  learners’  engagement  in  comprehension 
monitoring,  it  does  not  necessarily  reduce  underachievement.  Ben¬ 
eficial  effects  on  learning  outcomes  (e.g.,  the  acquisition  of  con¬ 


ceptual  knowledge)  are  subject  to  the  condition  that  learners  are 
able  to  use  the  knowledge  regarding  their  comprehension  difficul¬ 
ties  to  take  effective  regulation  decisions.  This  ability  may  not  be 
simultaneously  present  with  the  ability  to  monitor  one’s  compre¬ 
hension  and  appears  to  be  linked  with  the  learner’s  age. 

The  multimedia  presentations  about  the  dangers  of  making 
overconfident  JOLs  did  not  include  any  information  on  how  to 
monitor  one’s  own  comprehension.  Thus,  in  contrast  to  a  large 
proportion  of  previous  interventions  to  foster  comprehension  mon¬ 
itoring,  the  presentations  were  not  designed  to  foster  the  learners’ 
monitoring  skills.  Instead,  they  were  explicitly  designed  to  provide 
them  with  persuasive  reasons  why  increasing  the  effort  in  moni¬ 
toring  one’s  comprehension  could  be  beneficial  (i.e.,  conditional 
knowledge).  Against  this  background,  the  consistent  finding  of  all 
three  experiments  that  the  informed  learners  showed  more  moni¬ 
toring  episodes  (i.e.,  detected  comprehension  difficulties)  than 
their  uninformed  counterparts  indicates  that  not  only  insufficient 
monitoring  skills  but  also  insufficient  will  to  engage  in  compre¬ 
hension  monitoring  contributes  to  the  perennial  problem  of  learn¬ 
ers  insufficiently  monitoring  their  comprehension.  Because  they 
lack  conditional  knowledge  (e.g.,  Schraw,  1998;  Zohar  &  Peled, 
2008)  regarding  comprehension  monitoring,  learners  are  not  will¬ 
ing  to  invest  significant  effort  in  this  process.  As  a  consequence, 
learners  might  often  overlook  comprehension  difficulties  that  they 
could  have  detected  on  the  basis  of  their  respective  monitoring 
skills,  which  likely  contributes  to  overly  optimistic  JOLs  and 
underachievement. 

Informing  learners  about  the  dangers  of  making  overconfident 
JOLs  proved  to  be  an  effective  remedy  for  both  university  and  13- 
to  15-year-old  high  school  students  not  only  with  respect  to  learn¬ 
ers’  insufficient  engagement  in  comprehension  monitoring  but  also 
with  respect  to  the  problem  of  making  overly  optimistic  JOLs. 
Given  the  same  level  of  acquired  conceptual  knowledge,  we  con¬ 
sistently  found  that  the  increase  in  the  number  of  detected  com¬ 
prehension  difficulties  caused  a  decrease  in  the  informed  learners’ 
JOLs  (with  the  exception  of  one  merely  marginally  significant 
effect;  see  Experiment  1).  Thus,  although  our  measures  are  not 
suitable  for  diagnosing  overconfidence  per  se  (for  an  extended 
discussion  of  this  limitation,  see  below),  it  is  reasonable  to  con¬ 
clude  that  the  informed  learners  at  least  judged  their  level  of 
comprehension  more  cautiously  than  their  counterparts. 

By  contrast,  informing  learners  about  the  dangers  of  making 
overconfident  JOLs  is  not  necessarily  an  effective  remedy  for  the 
problem  of  underachievement.  The  results  of  Experiments  2  and  3 
suggest  that  informing  13-  to  15-year-old  high  school  students 
about  the  dangers  of  making  overconfident  JOLs  did  not  lead  to 
them  taking  effective  regulation  decisions  and  to  an  increase  in  the 
acquired  conceptual  knowledge.  This  pattern  of  results  supports  de 
Bruin  et  al.’s  (2011)  notion  that  the  ability  to  use  monitoring  to 
effectively  regulate  subsequent  learning  develops  later  than  the 
ability  to  monitor  one’s  own  comprehension  (see  also  Dunlosky  & 
Rawson,  2012).  However,  the  beneficial  effect  of  the  relatively 
brief  multimedia  presentation  about  effective  regulation  strategies 
(only  5  min  instruction)  implemented  in  Experiment  3  suggests 
that  the  13-  to  15-year-old  high  school  students’  lack  of  effective 
regulation  decisions  was  not  because  of  fundamental  deficiencies 
regarding  their  regulation  skills.  Rather,  similar  to  the  insufficient 
level  of  engagement  in  comprehension  monitoring,  this  might  be 
because  of  a  lack  of  metastrategic  knowledge  concerning  regula- 
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tion  strategies.  However,  this  explanation  should  be  treated  cau¬ 
tiously  because  the  short  multimedia  presentation  did  not  exclu¬ 
sively  include  metastrategic  knowledge  regarding  the  two 
regulation  strategies,  but  also  provided  the  learners  with  informa¬ 
tion  on  how  to  apply  them. 

Limitations  and  Future  Research 

In  addition  to  this  limitation  in  Experiment  3,  the  present  ex¬ 
periments  have  some  further  important  limitations.  First,  it  is 
worth  highlighting  that  our  JOLs  measures  were  not  suitable  for 
diagnosing  overconfidence  per  se.  Diagnosing  overconfidence 
would  have  required  the  learners  to  judge  their  level  of  compre¬ 
hension  on  the  same  scale  as  their  posttest  performance  was 
measured  (i.e.,  on  a  scale  from  0  to  100%).  It  would  also  have  been 
necessary  to  inform  the  learners  about  the  specific  posttest  ques¬ 
tions.  Otherwise,  even  if  we  had  asked  them  judge  their  level  of 
comprehension  on  a  scale  from  0  to  100%,  the  learners  might  still 
have  anticipated  different  questions  than  the  actual  posttest  ques¬ 
tions.  Consequently,  if  their  JOLs  exceeded  their  posttest  perfor¬ 
mance,  it  would  have  been  unclear  if  the  learners  actually  over¬ 
confidently  judged  their  level  of  comprehension  or  simply 
anticipated  easier  posttest  questions  (for  a  related  notion,  see 
Thiede,  Redford.  Wiley,  &  Griffin,  2012).  As  informing  the  learn¬ 
ers  about  the  specific  posttest  questions  during  the  learning  phase 
would  have  been  an  additional  intervention  that  most  likely  would 
have  affected  the  learners’  processing  of  the  subsequent  instruc¬ 
tional  explanations  (for  overviews  of  the  effects  of  adjunct  ques¬ 
tions',  see  Andre,  1979;  Hamaker,  1986;  McCrudden  &  Schraw, 
2007),  we  decided  against  trying  to  diagnosing  overconfidence  in 
our  experiments. 

A  second  limitation  is  that  we  did  not  assess  the  cues  the 
learners  utilized  to  monitor  their  comprehension  (cf.  Thiede  et  ah, 
2010).  We  did  not  explicitly  investigate  whether  the  learners  used 
relatively  predictive  comprehension-based  cues  (e.g.,  their  ability 
to  generate  keywords  for  a  previously  read  explanation,  see  Thiede 
et  ah,  2003)  or  cues  that  entailed  a  relatively  low  predictive 
validity  (e.g.,  their  ability  to  recall  a  previously  read  explanation, 
e.g.,  Baker  &  Dunlosky,  2006;  Thiede  et  ah,  2010).  Thus,  even 
though  the  results  which  showed  that  the  detected  comprehension 
difficulties  were  suitable  for  both  university  students  (Experiment 
1)  and  13-  to  15-year-old  high  school  students  (Experiment  3)  for 
taking  beneficial  regulation  decisions  suggest  that  the  cues  the 
learners  utilized  entailed  at  least  a  certain  degree  of  predictive 
validity,  we  do  not  know  the  extent  to  which  our  information 
intervention  fell  on  fertile  ground.  Consequently,  we  do  not  know 
whether  and,  as  the  case  may  be,  the  extent  to  which  the  effec¬ 
tiveness  of  informing  learners  about  the  dangers  of  making  over¬ 
confident  JOLs  depends  on  learners’  cue  utilization.  To  address 
this  issue,  future  studies  could  assess  learners’  cue  utilization  (e.g., 
by  using  a  questionnaire,  see  Thiede  et  ah,  2010,  or  standardized 
postreading  interviews)  and  use  it  as  a  moderator  in  subsequent 
analyses.  Another  possibility  could  be  to  factorially  vary  whether 
learners  are  informed  about  (a)  the  dangers  of  making  overconfi¬ 
dent  JOLs  and  (b)  predictive  cues. 

In  addition  to  clarifying  the  degree  to  which  the  effects  of 
informing  learners  about  the  dangers  of  making  overconfident 
JOLs  depend  on  learners’  cue  utilization,  these  studies  would  also 
shed  light  on  whether  informing  learners  about  the  dangers  of 


making  overconfident  JOLs  affects  this  very  cue  utilization.  Al¬ 
though  the  present  experiments  clearly  indicate  that  learners  in¬ 
sufficiently  engage  in  comprehension  monitoring  because  they 
lack  conditional  knowledge  thereof,  our  findings  do  not  provide 
insight  into  how  learners  increase  their  level  of  engagement  once 
they  are  informed  about  the  dangers  of  making  overconfident 
JOLs.  One  possibility  is  that  learners  might  allocate  more  cogni¬ 
tive  resources  to  monitoring  without  changing  the  cues  they  utilize. 
However,  it  is  also  conceivable  that  learners  change  their  cue 
utilization  in  such  a  way  that  they  are  less  likely  to  utilize  shallow 
cues  and  more  likely  to  utilize  comprehension-based  cues  that 
might  require  a  higher  share  of  cognitive  resources  (and  are 
avoided  by  learners  who  are  not  aware  of  the  dangers  of  making 
overconfident  JOLs).  These  mechanisms  might  also  differ  between 
learners  of  different  ages.  For  instance,  in  the  absence  of  knowl¬ 
edge  regarding  predictive  cues,  high  school  students  might  allocate 
more  resources  to  monitoring  without  switching  their  utilized  cues. 
On  the  other  hand,  university  students  might  switch  to  utilizing 
more  predictive  (and  cognitively  more  demanding)  cues. 

In  addition  to  addressing  these  open  questions,  future  research 
could  also  investigate  how  the  interventions  used  in  the  present 
experiments  could  be  optimized.  For  instance,  in  the  multimedia 
presentations  that  were  designed  to  inform  learners  about  the 
dangers  of  making  overconfident  JOLs  the  learners  watched  a 
model  learner  who  read  an  expository  text  related  to  the  topics  of 
clinical  psychology  (Experiment  1)  or  English  grammar  (Experi¬ 
ments  2  and  3).  As  both  of  these  topics  did  not  align  with  the 
content  of  the  subsequent  instructional  explanations  (i.e.,  atomic 
structure),  the  learners  had  to  transfer  what  they  learned  to  benefit 
from  the  information  about  the  dangers  of  making  overconfident 
JOLs  in  the  learning  phase.  This  might  have  been  a  significant 
hurdle  that  not  all  learners  could  overcome,  especially  the  13-  to 
15-year-old  high  school  students.  Therefore,  future  studies  could 
investigate  whether  the  effects  of  informing  learners  about  the 
dangers  of  making  overconfident  JOLs  would  be  even  more  ben¬ 
eficial  if  the  model  learner  is  contextualized  in  a  topic  that  more 
closely  aligns  with  the  topic  of  subsequently  provided  learning 
material. 

The  multimedia  presentation  about  effective  regulation  strate¬ 
gies  might  have  room  for  optimization  as  well.  The  design  of  the 
presentation  was  relatively  parsimonious;  the  learners  received 
information  about  the  functional  value  of  the  two  strategies  as  well 
as  few  suggestions  on  how  to  execute  them.  Although  our  results 
suggest  that  this  intervention  entailed  beneficial  effects,  it  is  rea¬ 
sonable  to  assume  that,  at  least  for  learners  who  were  not  already 
in  part  familiar  with  these  strategies,  it  would  have  been  even  more 
beneficial  if  they  had  received  more  extensive  instructional  guid¬ 
ance  such  as  worked  examples  (e.g.,  Hiibner  et  al.,  2010;  for  a 
recent  overview  of  example-based  learning,  see  Renkl,  2014) 
and/or  feedback  (e.g.,  Roelle,  Berthold,  &  Fries,  2011;  for  an 
overview  of  effects  of  feedback,  see  Hattie  &  Timperley,  2007). 
Admittedly,  all  these  hypotheses  remain  tentative  and  should  be 
addressed  by  future  studies. 

Conclusions 

Because  of  their  lack  of  metastrategic  knowledge  regarding 
comprehension  monitoring,  both  university  and  13-  to  15-year-old 
high  school  students’  insufficiently  engage  in  comprehension 
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monitoring.  Thus,  interventions  designed  to  foster  comprehension 
monitoring  should  focus  not  only  on  enhancing  learners’  monitor¬ 
ing  skills  (e.g.,  by  directing  them  toward  predictive  cues),  but  also 
on  providing  learners  with  metastrategic  knowledge  about  com¬ 
prehension  monitoring  and  enhancing  learners’  will  to  engage  in 
this  process.  A  relatively  parsimonious  and  promising  means  to  do 
so  is  to  (simply)  inform  learners  about  the  dangers  of  making 
overconfident  JOLs.  By  itself,  however,  this  intervention  may  not 
be  enough  to  reduce  underachievement,  which  often  comes  along 
with  insufficient  comprehension  monitoring,  because  some  learn¬ 
ers,  especially  at  the  high  school  level,  lack  knowledge  regarding 
effective  regulation  strategies.  For  these  learners,  this  intervention 
needs  to  be  combined  with  further  instructional  support  measures 
that  focus  on  fostering  regulation  strategies  to  be  effective. 
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There  is  growing  recognition  that  cumulative  economic  risk  places  children  at  higher  risk  for  depressed 
academic  competencies  (Crosnoe  &  Cooper,  2010;  NCCP,  2008;  Sameroff,  2000).  Yet,  children  s 
temperamental  regulation  and  the  quality  of  the  early  childhood  classroom  environment  have  been 
associated  with  better  academic  skills.  This  study  is  an  examination  of  prekindergarten  classroom  quality 
(instructional  support,  emotional  support,  organization)  as  a  moderator  between  temperamental  regula¬ 
tion  and  early  math  and  literacy  skills  for  children  at  varying  levels  of  cumulative  economic  risk.  The 
sample  includes  children  enrolled  in  Head  Start  programs  drawn  from  the  FACES  2009  study.  Three 
main  findings  emerged.  First,  for  lower  and  highest  risk  children,  more  instructional  support  was 
associated  with  better  math  performance  when  children  had  high  levels  of  temperamental  regulation  but 
poorer  performance  when  children  had  low  temperamental  regulation.  Second,  among  highest  risk 
children,  low  instructional  support  was  protective  for  math  performance  for  children  with  low  temper¬ 
amental  regulation  and  detrimental  for  those  with  high  temperamental  regulation.  Third,  for  highest  risk 
children,  high  classroom  organization  predicted  better  literacy  scores  for  those  with  high  temperamental 
regulation.  Children  with  low  temperamental  regulation  were  expected  to  perform  about  the  same, 
regardless  of  the  level  of  classroom  organization.  Implications  are  discussed. 
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Optimizing  children’s  learning  capacity  is  among  the  foremost 
social  and  political  goals  of  the  current  national  agenda.  Early 
childhood  has  been  identified  as  a  sensitive  period  encompassing 
experiences  that  affect  the  development  of  salient  socioemotional 
and  cognitive  competencies  that  are  linked  to  long-term  adjust¬ 
ment  outcomes  (e.g.,  Entwisle,  1995).  A  well-established  body  of 
research  points  to  the  critical  role  of  poverty  in  the  development  of 
academic  competencies  in  early  childhood  and  beyond  (Duncan  et 


This  article  was  published  Online  First  April  7,  2016. 

Kathleen  Moritz  Rudasill,  Department  of  Educational  Psychology,  Uni¬ 
versity  of  Nebraska-Lincoln;  Leslie  R.  Hawley,  Nebraska  Center  for  Re¬ 
search  on  Children,  Youth,  Families,  and  Schools,  University  of  Nebraska- 
Lincoln;  Jennifer  LoCasale-Crouch,  Center  for  Advanced  Study  of 
Teaching  and  Learning,  University  of  Virginia;  Eric  S.  Buhs,  Department 
of  Educational  Psychology,  University  of  Nebraska-Lincoln. 

This  research  was  supported  by  a  grant  from  the  American  Educational 
Research  Association,  which  receives  funds  for  its  “AERA  Grants  Pro¬ 
gram”  from  the  National  Science  Foundation  (NSF)  under  NSF  Grant 
DRL-0941014.  Opinions  reflect  those  of  the  authors  and  do  not  necessarily 
reflect  those  of  the  granting  agencies. 

Correspondence  concerning  this  article  should  be  addressed  to  Kathleen 
Moritz  Rudasill,  Department  of  Educational  Psychology,  University  of 
Nebraska-Lincoln,  221  Teachers  College  Hall,  P.O.  Box  880345,  Lincoln,  NE 
68588-0345.  E-mail:  krudasill2@unl.edu 


al.,  2007;  Raver,  Gershoff,  &  Aber,  2007);  there  is  also  growing 
recognition  that  cumulative  economic  risk  (i.e.,  having  multiple 
indicators  of  economic  risk)  places  children  at  greater  risk  for  poor 
academic  competencies  (Appleyard,  Egeland,  van  Dulmen,  & 
Sroufe,  2005;  Crosnoe  &  Cooper,  2010;  Sameroff,  2000).  Chil¬ 
dren’s  temperamental  regulation  (i.e.,  the  temperament-based  abil¬ 
ity  to  attend  to  appropriate  stimuli  and  inhibit  inappropriate  be¬ 
haviors)  has  also  been  linked  to  academic  success,  with  low  levels 
of  temperamental  regulation  associated  with  poorer  academic 
competence  (Martin,  Drew,  Gaddis,  &  Moseley,  1988;  Rudasill, 
Gallagher,  &  White,  2010).  However,  evidence  suggests  that  high- 
quality  early  childhood  classrooms  can  be  protective  for  outcomes 
of  children  with  economic  (Crosnoe  &  Cooper,  2010;  Hamre  & 
Pianta,  2005)  or  temperamental  risk  (e.g.,  poor  attention,  Rudasill 
et  al.,  2010;  over-  or  undercontrolled,  Vitiello,  Moas,  Henderson, 
Greenfield,  &  Munis,  2012).  Yet,  there  have  been  virtually  no 
investigations  of  how  children’s  temperamental  regulation  may 
compound  (at  low  levels)  or  mitigate  (at  high  levels)  cumulative 
economic  risk,  and  the  potential  protective  role  of  high-quality 
classrooms  across  varying  levels  of  cumulative  economic  risk. 

This  study  is  an  examination  of  temperamental  regulation,  class¬ 
room  quality,  and  cumulative  economic  risk  as  predictors  of  chil¬ 
dren’s  academic  skills  at  the  end  of  prekindergarten  in  a  sample  of 
children  who  are  enrolled  in  Head  Start  programs.  Children  served 
by  Head  Start  programs  are  typically  economically  disadvantaged 
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(at  least  90%  of  families  enrolled  in  Head  Start  must  be  below  the 
poverty  line),  and  many  have  multiple  economic  risk  factors, 
including  low  levels  of  parental  education  and  single-parent  house¬ 
holds.  Although  low  income  is  often  used  as  an  indicator  of 
economic  risk,  evidence  suggests  that  a  combination  of  risk  factors 
(e.g.,  low  income  and  low  parental  education)  is  most  detrimental 
for  children’s  academic  outcomes  (Crosnoe  &  Cooper,  2010).  A 
novel  contribution  of  this  study  is  the  examination  of  varying 
levels  of  cumulative  economic  risk  among  an  already  at-risk 
population  of  children,  thus,  enabling  investigation  of  tempera¬ 
mental  regulation  and  classroom  quality  for  children  at  the  highest 
levels  of  risk. 

Theoretical  Framework 

This  study  is  grounded  in  bioecological  theory  (Bronfenbrenner 
&  Morris,  1998),  positing  that  development  occurs  through  bidi¬ 
rectional  interactions  (i.e.,  proximal  processes)  between  an  indi¬ 
vidual  and  people,  objects,  and  symbols  in  the  environment.  Prox¬ 
imal  processes  are  influenced  by  the  characteristics  of  the 
individual  and  his  or  her  environment.  From  this  perspective,  we 
use  a  Child  X  Environment  model  (Ladd,  2003)  to  conceptualize 
complex  interactions  between  characteristics  of  children  (temper¬ 
amental  regulation  and  cumulative  economic  risk)  and  preschool 
classroom  quality  (instructional  support,  emotional  support,  and 
organization)  as  predictors  of  children’s  academic  skill  develop¬ 
ment  in  prekindergarten.  Higher  classroom  quality  is  conceptual¬ 
ized  as  a  potential  buffer  for  children  with  poor  temperamental 
regulation,  high  cumulative  economic  risk,  or  both.  This  complex 
set  of  interactions  serves  as  a  potentially  powerful  example  for 
examining  the  assertions  of  Child  X  Environment  models. 

Although  evidence  suggests  classroom  quality  may  mitigate 
negative  effects  of  demographic  or  temperamental  risk  (Curby  et 
al.,  201 1;  Hamre  &  Pianta,  2005),  there  is  no  research  that  exam¬ 
ines  classroom  quality  as  a  moderator  of  the  association  between 
temperamental  regulation  and  academic  performance  for  children 
at  varying  levels  of  higher  cumulative  economic  risk.  It  may  be 
that  low  temperamental  regulation  and  high  cumulative  economic 
risk  are,  in  combination,  too  maladaptive  to  be  mitigated  by 
high-quality  classrooms.  For  example,  Crosnoe  et  al.  (2010)  found 
that,  for  children  from  low-income  families  with  less  stimulating 
home  environments,  being  in  stimulating  preschool  and  first-grade 
environments  was  not  sufficient  to  compensate  for  deficits  in 
reading  and  math  skill  development  relative  to  skills  of  higher 
income  peers  or  lower  income  peers  from  more  stimulating  home 
environments.  Thus,  although  we  expect  better  math  and  literacy 
performance  for  children  with  higher  levels  of  temperamental 
regulation  who  are  in  higher  quality  preschool  classrooms,  hypoth¬ 
eses  regarding  the  extent  to  which  this  may  vary  in  interaction  with 
different  levels  of  cumulative  economic  risk  in  family  contexts 
have  not  been  empirically  examined. 

Cumulative  Economic  Risk 

Children  from  low-income  families  typically  begin  kindergarten 
with  far  fewer  skills  than  their  higher  income  classmates  (U.S. 
Department  of  Health  &  Human  Services,  2010)  and,  without 
intervention,  they  fall  even  further  behind  as  they  move  through 
elementary  grades  (McClelland  et  al.,  2006).  Though  higher  eco¬ 


nomic  risk  has  consistently  predicted  poorer  outcomes  for  chil¬ 
dren,  growing  evidence  suggests  that  assessing  the  accumulation 
of  risk  factors,  rather  than  the  presence  or  absence  of  a  specific  risk 
factor,  provides  a  more  accurate  estimate  of  the  potential  impact  on 
development  (Appleyard  et  al.,  2005;  Crosnoe  &  Cooper,  2010; 
Sameroff,  2000).  Sameroff,  Bartko,  Baldwin,  Baldwin,  and  Seifer 
(1998)  found  that  the  level  of  cumulative  risk  experienced  in 
preschool,  for  example,  was  associated  both  with  concurrent  be¬ 
havior  problems  in  preschool  as  well  as  downstream  mental  health, 
problem  behavior,  and  academic  problems  in  adolescence.  Ac¬ 
cording  to  Appleyard  et  al.  (2005),  the  timing  of  risk  is  particularly 
important,  with  the  number  of  risks  in  early  childhood  indepen¬ 
dently  predicting  an  increase  in  behavior  problems  over  and  above 
indices  of  adolescent  risk.  On  the  basis  of  this  evidence,  we  used 
a  similar  strategy  in  the  current  study. 

Crosnoe  and  Cooper  (2010)  showed  that  classroom-level  factors 
moderated  the  association  between  child  cumulative  risk  status  and 
achievement  in  elementary  school.  This  suggests  that  even  for 
groups  of  children  assumed  to  experience  similar  levels  of  eco¬ 
nomic  risk  (e.g.,  Head  Start  participants)  variation  in  children’s 
cumulative  economic  risk  may  interact  with  potential  protective 
factors  to  impact  children’s  outcomes.  Thus,  failure  to  examine 
variations  in  outcomes  as  a  consequence  of  cumulative  risks  may 
mask  some  profound  differences. 

Child  Temperament 

Temperament  is  an  individual’s  general  style  of  responding  to 
the  environment  and  refers  to  innate  individual  differences  in 
reactivity  and  regulation  (Rothbart  et  al.,  2000).  Temperament  is 
“the  relatively  enduring  biological  makeup  of  the  [individual], 
influenced  over  time  by  heredity,  maturation,  and  experience” 
(Rothbart  &  Derryberry,  1981,  p.  40).  It  is  widely  acknowledged  to 
be  relatively  stable  through  early  elementary  school  and  beyond 
(e.g.,  Caspi  &  Silva,  1995;  Rothbart  &  Posner,  2005)  and  to  result 
from  and  influence  interactions  between  genes  and  the  environ¬ 
ment  (Shiner  et  al.,  2012).  Although  temperamental  reactivity 
refers  to  one’s  initial  emotional  and  behavioral  reaction  to  stimuli 
in  the  environment,  temperamental  regulation  operates  on  reactiv¬ 
ity  (Rothbart  &  Posner,  2005);  it  is  the  temperamentally  based 
ability  to  regulate  an  initial  emotional  and  behavioral  reaction.  For 
example,  a  child’s  initial  response  (reactivity)  in  a  dispute  with  a 
peer  may  be  to  cry  or  lash  out;  temperamental  regulation  may, 
however,  help  the  child  curb  the  initial  reaction  and  enact  a  more 
moderate,  socially  acceptable  response.  Because  of  the  extensive 
literature  linking  temperamental  regulation  and  related  constructs 
(e.g.,  executive  function)  to  children’s  academic  readiness  and 
success  (e.g.,  Blair,  2002),  temperamental  regulation  is  the  focus 
of  this  study. 

Temperamental  Regulation 

Children’s  temperamental  regulation  begins  to  develop  in  the 
second  or  third  year  of  life,  with  the  greatest  development  occur¬ 
ring  during  the  preschool  and  early  elementary  years  (Rothbart  & 
Bates,  2006).  Temperamental  regulation  includes  the  abilities  to 
attend  to  stimuli,  inhibit  inappropriate  responses,  and  align  behav¬ 
ior  with  context-specific  expectations  (Rothbart  &  Bates,  2006).  It 
is  similar  to  executive  function  (EF),  which  also  indicates  regu- 
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lated  behavior  and  incorporates  attention  and  inhibitory  control. 
However,  EF  emphasizes  the  cognitive  basis  of  regulation,  and 
includes  components  beyond  those  in  temperamental  regulation 
such  as  working  memory  and  planning  (Liew,  2011;  Wolfe  &  Bell, 
2003). 

High  levels  of  temperamental  regulation  appear  to  pay  particu¬ 
larly  big  dividends  in  the  school  setting.  Not  only  do  more  regu¬ 
lated  children  have  an  easier  time  with  basic  behavioral  expecta¬ 
tions  in  the  classroom  (e.g.,  sitting  still,  taking  turns,  remaining 
quiet,  and  following  directions;  Blair,  2002;  Rimm-Kaufman, 
Curby,  Grimm,  Nathanson,  &  Brock,  2009)  but  also  they  are 
advantaged  when  it  comes  to  benefiting  from  instructional  activ¬ 
ities.  They  may  be  more  likely  to  persist  with  difficult  tasks,  work 
efficiently  in  loud  or  distracting  environments,  and  pay  attention  to 
teacher  instruction — skills  associated  with  better  academic  out¬ 
comes  (Bierman,  Nix,  Greenberg,  Blair,  &  Domitrovich,  2008; 
McClelland  et  al.,  2007).  In  an  examination  of  links  between 
children’s  regulatory  and  academic  skills  prior  to  kindergarten, 
McClelland  and  colleagues  (2007)  found  that  children’s  growth  in 
regulation  was  positively  associated  with  their  growth  in  emergent 
literacy,  vocabulary,  and  math  skills  across  the  prekindergarten 
year.  Relatedly,  Bierman  et  al.  (2008)  found  robust  associations 
between  Head  Start  children’s  regulation  (operationalized  as  EF) 
in  the  fall  of  prekindergarten  and  their  language,  emergent  literacy, 
and  behavioral  outcomes  at  the  end  of  the  year.  Together,  these 
findings  highlight  the  importance  of  further  inquiry  into  the  role  of 
regulation  in  young  children’s  academic  lives,  particularly  into  the 
factors  that  may  ameliorate  or  exacerbate  poor  regulation  skills. 

Classroom  Quality 

Observations  of  classroom  instructional  support,  emotional  sup¬ 
port,  and  organization  have  been  used  to  assess  the  nature  of 
interactions  (proximal  processes)  between  teachers  and  children  in 
classrooms  (i.e.,  classroom  quality).  Mounting  evidence  points  to 
the  importance  of  high-quality  preschool  classrooms  for  promoting 
children’s  early  academic  skills  and  readiness  for  kindergarten 
(e.g.,  Burchinal,  Howes  et  al.,  2008;  Vitiello  et  al.,  2012),  as  well 
as  the  capacity  for  high  quality  classrooms  to  buffer  children 
at-risk  for  academic  difficulties  due  to  low  socioeconomic  status 
(SES;  e.g.,  Burchinal,  Howes  et  al.,  2008;  Maier,  Vitiello,  & 
Greenfield,  2012).  The  Classroom  Assessment  Scoring  System 
(CLASS;  Pianta  et  al.,  2008)  is  an  observational  tool  designed  to 
capture  these  classroom  processes,  and  indicators  of  instructional 
support,  emotional  support,  and  classroom  organization  from  this 
assessment  have  been  consistently  predictive  of  children’s  aca¬ 
demic  and  behavioral  outcomes  in  preschool  and  early  elementary 
grades  (e.g.,  Mashburn  et  al.,  2008;  Pianta,  Belsky,  Vandergrift, 
Houts,  &  Morrison,  2008). 

Instructional  support.  Instructional  support  refers  to  teach¬ 
ers’  interactions  with  children  that  promote  concept  and  skill 
development  through  scaffolding,  questioning,  and  feedback  be¬ 
tween  teachers  and  children.  Growing  evidence  suggests  that  better 
instructional  support  promotes  greater  achievement.  In  a  large 
study  of  academic  and  social  skills  among  children  in  public 
prekindergarten  programs,  Mashburn  et  al.  (2008)  found  that  class¬ 
room  instructional  support  positively  predicted  children’s  literacy, 
language,  and  math  scores  above  and  beyond  a  host  of  child  and 
teacher  characteristics. 


Emotional  support.  Emotional  support  in  the  preschool 
classroom  is  also  important  for  children’s  early  school  success, 
particularly  for  children  at  risk  for  academic  difficulty  (O  Connor 
&  McCartney,  2007).  Teachers  who  provide  child-centered  class¬ 
room  environments,  marked  by  positive  climate,  warmth,  and 
teacher  sensitivity  (La  Paro  et  al.,  2004)  are  likely  to  have  pupils 
who  thrive  academically  (e.g.,  O’Connor,  Cappella,  McCormick, 
&  McClowry,  2014). 

Classroom  organization.  Given  the  growing  body  of  work 
indicating  the  importance  of  children’s  early  self-regulatory  and 
executive  functioning  skills,  recent  attention  has  focused  on  the 
organizational  and  managerial  aspects  of  the  classroom  that  appear 
to  be  most  important  for  the  development  of  these  skills  (Ursache, 
Blair,  &  Raver,  2012).  Young  children  demonstrate  better  behav¬ 
ioral  and  cognitive  self-control  in  the  classroom  and  less  time 
off-task  when  teachers  proactively  manage  their  behavior  and 
attention  (Rimm-Kaufman  et  al.,  2009).  In  addition,  evidence 
suggests  that  well-organized  and  managed  classrooms  benefit  chil¬ 
dren’s  cognitive  and  academic  development  (Downer,  Sabol,  & 
Hamre,  2010). 

Classroom  Quality  as  a  Moderator  Between  Child 
Characteristics  and  Outcomes 

Accumulating  evidence  on  the  value  of  high-quality  classrooms 
suggests  that  such  classrooms  are  likely  to  be  protective  for  chil¬ 
dren  at  risk  for  academic  and/or  behavioral  problems.  In  a  seminal 
study,  Hamre  and  Pianta  (2005)  tested  instructional  support  and 
emotional  support  in  first-grade  classrooms  as  moderators  between 
children’s  risk  in  kindergarten  (both  demographic  and  functional 
risk)  and  their  academic  achievement  and  relationship  quality  with 
teachers  in  first  grade.  They  found  that  demographic  risk  (indi¬ 
cated  by  lower  maternal  education  level)  was  moderated  by  in¬ 
structional  support  for  predicting  children’s  achievement  (aggre¬ 
gated  across  scores  on  all  Woodcock-Johnson  subtests  of  cognition 
and  achievement),  such  that  children  in  high  instructional  support 
first-grade  classrooms  with  less  educated  mothers  performed  just 
as  well  as  their  peers  with  more  educated  mothers;  but  in  low 
instructional  support  classrooms,  children  with  less  educated 
mothers  were  substantially  outperformed  by  their  peers  with  more 
educated  mothers.  Classroom  emotional  support  emerged  as  a 
similar  mechanism  for  children’s  functional  risk  (the  presence  or 
absence  of  two  or  more  risk  factors  drawn  from  sustained  atten¬ 
tion,  externalizing  behavior,  and  social  and  academic  competence) 
and  achievement  and  teacher-child  conflict.  Rudasill  et  al.  (2010) 
extended  this  work  and  examined  lower  levels  of  temperamental 
attention  in  preschool  as  a  potential  risk  factor  for  later  achieve¬ 
ment.  Emotional  support  in  third-grade  classrooms  was  construed 
as  a  moderator  of  the  link  between  earlier  attention  and  later 
reading  and  math  achievement.  They  found  effects  consistent  with 
findings  from  Hamre  and  Pianta  (200$);  high  emotional  support 
was  protective.  That  is,  in  high  emotional  support  classrooms, 
children  with  low  attention  performed  as  well  as  their  peers  with 
high  attention,  but  in  low  emotional  support  classrooms,  better 
attention  was  associated  with  better  performance  on  math  and 
reading  assessments. 

Recent  research  has  provided  some  additional  evidence  of  the 
protective  capacity  of  classroom  quality  among  low-SES  pre¬ 
school  children.  Maier  et  al.  (2012)  explored  Head  Start  children’s 
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teacher-rated  psychosocial  strengths  (i.e.,  initiative,  self-control, 
and  attachment)  as  predictors  of  language  and  literacy  skills  and 
the  extent  to  which  psychosocial  strengths  were  moderated  by 
preschool  classroom  quality.  Results  indicated  that  psychosocial 
strengths  and  classroom  organization  exerted  main  effects  on 
children  s  language  and  literacy  skills,  but  there  was  no  evidence 
of  interactions  between  classroom  quality  and  psychosocial 
strengths.  In  another  study  of  Head  Start  children,  Dominguez, 
Vitiello,  Fuccillo,  Greenfield,  and  Bulotsky-Shearer  (201 1)  exam¬ 
ined  classroom  quality  as  a  moderator  between  children’s 
classroom-based  problem  behaviors  and  their  approaches  to  learn¬ 
ing  (ATL).  High  emotional  support  classrooms  were  protective;  in 
those  classrooms,  children  with  problems  interacting  with  teachers 
and  problems  in  instructional  activities  demonstrated  the  same 
proficiency  with  ATL  as  their  counterparts  with  few  problems. 
Unexpectedly,  in  high  instructional  support  classrooms,  children 
with  more  problems  interacting  with  teachers  demonstrated  lower 
levels  of  ATL  than  their  peers  with  few  problems;  but  in  low 
instructional  support  classrooms,  children  with  more  problems 
demonstrated  the  same  ATL  as  children  with  few  problems.  Thus, 
low  instructional  support  appeared  to  be  protective  for  children 
with  more  problematic  interactions  with  teachers. 

Vitiello  et  al.  (2012)  conducted  a  study  of  Head  Start  children 
that  evaluated  classroom  quality  as  a  potential  moderator  between 
children’s  temperament  and  language,  literacy,  and  math  gains 
across  the  prekindergarten  year.  Head  Start  teachers  rated  chil¬ 
dren’s  temperament  as  overcontrolled  (e.g.,  shy,  fearful,  having 
intense  reactions),  resilient  (e.g.,  well  regulated,  positive,  friendly), 
or  undercontrolled  (e.g.,  low  in  regulation,  impulsive,  high  in 
activity)  at  the  beginning  of  the  school  year.  Teachers  also  rated 
children’s  language,  literacy,  and  math  skills  at  multiple  points 
during  the  school  year.  Classroom  quality  was  assessed  through 
observations  during  the  spring  of  the  prekindergarten  year.  Results 
indicated  that  high  emotional  support  was  protective  for  resilient 
children’s  increases  in  language  and  literacy  skills;  that  is,  resilient 
children  showed  more  improvement  as  emotional  support  in¬ 
creased.  However,  overcontrolled  children’s  increases  across  the 
year  did  not  vary  as  a  function  of  classroom  emotional  support.  In 
addition,  high  instructional  support  was  protective  for  overcon¬ 
trolled  children’s  increases  in  language  and  literacy  skills,  but  low 
instructional  support  was  protective  for  resilient  children’s  im¬ 
provement.  That  is,  as  instructional  support  increased,  overcon¬ 
trolled  children  had  more  growth  in  language  and  literacy  across 
the  school  year;  however,  resilient  children  had  more  growth  in 
language  and  literacy  as  instructional  support  decreased.  Collec¬ 
tively,  findings  from  these  studies  suggest  that  classroom  quality 
may  not  consistently  ameliorate  children’s  risk;  different  aspects 
of  classroom  quality  may  be  sources  of  support  or  stress  when 
examined  in  interaction  with  child  characteristics  such  as  temper¬ 
amental  regulation  or  economic  risk. 

The  Present  Study 

This  study  is  informed  by  evidence  from  three  bodies  of  liter¬ 
ature.  The  first  is  a  group  of  high-quality  studies  showing  aca¬ 
demic  decrements  and  disadvantages  for  children  from  homes  with 
greater  cumulative  economic  risk,  particularly  in  early  childhood. 
The  second  is  evidence  supporting  the  notion  that  children  s  aca¬ 
demic  achievement  is  predicated  in  part  on  children  s  tempera¬ 


mental  regulation.  The  third  shows  that  classroom  quality  is  related 
to  children’s  outcomes.  Emerging  work  combining  these  litera¬ 
tures  suggests  that  classroom  quality  may  be  protective  for  chil¬ 
dren  at  risk  for  academic  difficulty.  At  the  same  time,  there  is  some 
evidence  that  classroom  quality  does  not  work  in  the  same  way  for 
all  children  (e.g.,  Dominguez  et  al.,  2011).  Some  of  these  studies 
have  used  Head  Start  samples  (Dominguez  et  al.,  2011;  Maier  et 
al.,  2012;  Vitiello  et  al.,  2012)  or  used  tests  of  demographic  or 
temperamental  risk  (Curby  et  al.,  2011;  Hamre  &  Pianta,  2005; 
Rudasill  et  al.,  2010).  However,  none  has  included  examinations  of 
cumulative  economic  risk  as  a  moderator  of  associations  between 
temperament,  classroom  quality,  and  academic  skills  in  the  pre¬ 
kindergarten  year.  The  current  study  is  critical  because  the  inter¬ 
play  of  temperament  and  classroom  quality  factors  on  children’s 
academic  achievement  remains  understudied,  particularly  for  chil¬ 
dren  at  varying  levels  of  cumulative  economic  risk. 

Research  Questions  and  Hypotheses 

We  used  data  drawn  from  the  Head  Start  Family  and  Child 
Experiences  Survey  (FACES)  2009,  a  large,  longitudinal  study 
following  children  from  their  entrance  into  Head  Start  in  the  fall  of 
2009  (ages  3  and  4  years)  to  the  end  of  kindergarten.  With  these 
data,  we  examined  whether  prekindergarten  classroom  quality 
moderated  associations  between  temperament  and  literacy/math 
skills  differently  for  children  at  varying  levels  of  cumulative 
economic  risk.  We  expected  that  classroom  quality  would  have  a 
buffering  effect  for  children  with  low  temperamental  regulation  or 
high  cumulative  economic  risk,  or  both,  such  that  children  in 
higher  quality  classrooms  would  display  better  math  and  literacy 
skills  at  the  end  of  prekindergarten  than  their  peers  in  lower  quality 
classrooms.  We  also  hypothesized  the  following:  (a)  children  with 
higher  temperamental  regulation  would  display  better  math  and 
literacy  skills  at  the  end  of  prekindergarten;  (b)  children  with  lower 
cumulative  economic  risk  would  display  better  math  and  literacy 
skills  at  the  end  of  prekindergarten;  (c)  children  in  preschool 
classrooms  where  teachers  provide  higher  levels  of  instructional 
support,  emotional  support,  and  organization  would  have  better 
math  and  literacy  skills  at  the  end  of  prekindergarten;  (d)  children 
in  high-quality  preschool  classrooms  would  perform  well  in  as¬ 
sessments  of  math  and  literacy  skills,  regardless  of  individual 
differences  in  temperamental  regulation  or  cumulative  economic 
risk.  Conversely,  children  in  low-quality  preschool  classrooms 
with  poor  temperamental  regulation  or  high  cumulative  economic 
risk  would  be  outperformed  by  their  more  regulated  or  lower  risk 
peers. 

Method 

Sample  and  Participants 

Sampling  design.  The  FACES  2009  study  used  a  complex 
multistage  sampling  design  to  create  a  nationally  representative 
sample  of  3-  and  4-year-old  children  participating  in  Head  Start. 
Data  were  collected  from  3-  and  4-year-old  cohorts  starting  in  fall 
2009  through  either  spring  2011  (4-year-old  cohort)  or  2012  (3- 
year-old  cohort).  Data  were  obtained  using  a  four-stage  sampling 
approach  that  included:  (a)  Head  Start  programs  (60);  (b)  centers 
within  Head  Start  programs  (two  per  program);  (c)  classrooms 
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within  centers  (up  to  three  per  center);  and  (d)  children  within 
classrooms  (10  per  classroom;  U.S.  Department  of  Health  and 
Human  Services,  2013).  This  sequential  sampling  method  was 
based  on  the  Chromy  procedure,  which  uses  a  combination  of 
probability  proportional  to  size  (PPS),  implicit  and  explicit  strat¬ 
ified  sampling  to  select  the  programs,  centers,  and  classrooms 
(U.S.  Department  of  Health  and  Human  Services,  2013).  Explicit 
strata  included  census  region,  urbanicity,  and  racial/ethnic  minor¬ 
ity  enrollment.  Implicit  strata  included  percentage  in  program 
whose  primary  home  language  is  English,  percentage  of  children 
with  disabilities,  and  percentage  of  dual-language  learners.  PPS 
was  used  in  the  first  three  stages  with  selection  of  programs, 
centers,  and  classrooms.  The  fourth  and  final  stage  of  this  ap¬ 
proach  sampled  equal  numbers  of  children  within  classrooms.  The 
complete  dataset  for  the  4-year-old  cohort  in  fall  2009  included  1 2 
first-stage  strata,  60  programs,  365  classrooms,  and  an  average  of 
4  students  per  classroom.  For  more  information  regarding  the 
sampling  procedures,  see  the  FACES  user’s  guide  (U.S.  Depart¬ 
ment  of  Health  and  Human  Services,  2013). 

Because  FACES  data  were  drawn  from  a  multistage  sample 
rather  than  a  traditional  simple  random  sampling  design,  specific 
analytical  procedures  were  used  to  account  for  the  complex  sam¬ 
pling  to  avoid  biased  parameter  estimates  and  incorrect  standard 
errors  (Hahs-Vaughn,  McWayne,  Bulotsky-Shearer,  Wen,  &  Faria, 
2011;  Kaplan  &  Ferguson,  1999).  Model-  and  design-based  ap¬ 
proaches  are  the  two  general  methods  for  handling  data  obtained 
from  multistage  complex  sampling  (Muthen  &  Satorra,  1995).  To 
determine  the  best  approach  for  our  analyses,  we  evaluated  the 
consistency  of  parameter  estimates  from  both  approaches.  Design- 
and  model-based  analyses  were  conducted  using  the  software 
program  SAS  9.3  (SAS  Institute,  2011).  The  model-based  ap¬ 
proach  (HEM)  was  estimated  using  the  MIXED  procedure,  and  the 
design-based  approach  was  estimated  using  the  SURVEYREG 
procedure.  The  model-based  approach  applies  a  three  level  (LI: 
child;  L2:  classroom,  L3:  center)  model  with  sampling  weights. 
The  design-based  approach  estimates  a  single  level  model  and  uses 
the  sampling  weights  as  well  as  the  cluster  and  stratum  design 
variables.  Results  from  both  approaches  were  consistent  in  terms 
of  the  fixed  effect  estimates.  There  were  slight  differences  in  the 
variances  across  the  two  approaches,  but  there  was  no  impact  on 
our  conclusions.  Further,  although  a  model-based  approach  in 
MIXED  using  maximum  likelihood  (ML)  estimation  methods  is 
sometimes  preferred  when  there  are  missing  data,  this  was  not  the 
case  with  our  analyses  because  the  effective  sample  size  using  ML 
methods  through  the  MIXED  procedure  and  listwise  deletion 
through  the  SURVEYREG  procedure  was  the  same  (Allison, 
2012).  On  the  basis  of  the  consistency  in  our  conclusions  across 
the  two  approaches  and  the  recommendations  in  the  FACES  user’s 
guide  (U.S.  Department  of  Health  and  Human  Services,  2013),  we 
determined  the  most  appropriate  method  for  accounting  for  the 
complex  sampling  in  this  study  was  to  use  a  design-based  ap¬ 
proach  via  the  SURVEYREG  procedure  in  SAS. 

In  our  analysis,  we  specified  the  stratum  (Strata  =  STRAT)  and 
primary  sampling  unit  (cluster  =  PSU;  Head  Start  programs) 
variables  to  adjust  the  standard  errors  to  account  for  the  multistage 
sampling  (i.e.,  clustering).  Strata  act  as  independent  populations 
from  which  Head  Start  programs  or  PSUs  were  sampled  (Hahs- 
Vaughn  et  al.,  2011).  Using  the  SURVEYREG  procedure,  we 
applied  the  child  longitudinal  weight  variable  (weight  =  P12WT) 


to  adjust  parameter  estimates  for  differential  probabilities  of  se¬ 
lection  and  response  in  our  child-level  outcomes.  Multiple  weights 
are  available  within  the  FACES  data  set,  with  the  choice  of  weight 
depending  on  the  specific  variables  included  in  the  analysis.  Ac¬ 
cording  to  the  FACES  user  guide,  sampling  weights  were  created 
using  a  three-step  process.  First,  a  probability  of  selection  was 
calculated  at  each  stage  of  sampling  (program,  center,  classroom, 
child)  and  within  each  explicit  sampling  stratum.  Next,  the  inverse 
of  the  probability  of  selection  was  calculated  at  each  stage  and 
stratum.  The  inverse  of  the  probability  of  selection  is  called  the 
sampling  weight.  The  sampling  weight  is  used  in  the  final  step, 
where  at  each  stage  the  sampling  weight  is  multiplied  by  the 
inverse  of  the  weighted  response  rate.  It  is  assumed  that  eligibility 
status  of  each  sampled  unit  is  known  at  each  stage,  so  this  process 
adjusts  weights  to  account  for  respondents  and  nonrespondents 
(U.S.  Department  of  Health  and  Human  Services,  2013). 

Empirical  sample.  The  current  study  included  only  the 
4-year-old  cohort  because  data  were  collected  from  this  cohort  in 
both  the  fall  (2009)  and  spring  (2010)  of  the  prekindergarten  year. 
Additionally,  we  purposely  selected  only  children  in  the  4-year-old 
cohort  that  remained  in  the  same  classroom  from  fall  to  spring. 
Table  1  contains  the  weighted  and  unweighted  sample  statistics. 


Table  1 

Descriptive  Statistics 


Weighted 

Unweighted 

Variable 

n 

M  SE 

n  M 

SD 

T1  Age 

785 

52.37  0.23 

937  52.29 

3.83 

T1  Applied  Problems  score 

693 

383.71  1.29 

832  382.72 

23.81 

T2  Applied  Problems  score 

741 

396.6  1.20 

878  395.5 

23.89 

T1  Letter- Word  score 

718 

310.75  1.94 

860  310.58 

25.46 

T2  Letter- Word  score 

752 

331.31  1.65 

890  330.81 

27.31 

T2  Temperament 

759 

-0.13  0.23 

899  -0.03 

2.40 

T2  Classroom  Organization 

709 

4.68  0.05 

851  4.72 

0.61 

T2  Instructional  Support 

709 

2.26  0.05 

851  2.27 

0.60 

T2  Emotional  Support 

709 

5.26  0.06 

851  5.29 

0.51 

Frequency  Percentage 

Frequency  Percentage 

T1  Race 

Non-Hispanic,  White 

141 

20.52 

181 

19.36 

Minority 

643 

79.48 

754 

80.64 

T1  Gender 

Male 

388 

50.16 

469 

50.05 

Female 

397 

49.84 

468 

49.95 

T1  Risk 

Low 

625 

86.97 

718 

87.35 

High 

92 

13.03 

104 

12.65 

T1  AP  Language 

English 

667 

88.47 

803 

85.7 

Spanish 

118 

11.53 

134 

14.3 

T2  AP  Language 

English 

730 

93.93 

876 

93.49 

Spanish 

55 

64)7 

61 

6.51 

T1  LW  Language 

English 

666 

88.35 

802 

85.59 

Spanish 

119 

11.65 

135 

14.41 

T2  LW  Language 

English 

728 

93.72 

874 

93.28 

Spanish 

57 

6.28 

63 

6.72 

Note.  Estimates  were  weighted  using  the  P12WT  variable  and  standard 
errors  are  calculated  for  the  weighted  means.  T  =  time;  AP  =  Applied 
Problems;  LW  =  Letter  Word. 


CUMULATIVE  RISK,  TEMPERAMENT,  AND  CLASSROOM  QUALITY 


123 


Missing  data.  Owing  to  concerns  regarding  potential  bias 
and/or  representativeness  of  our  sample,  we  examined  different 
sources  of  missingness  using  the  MI  function  in  SAS  9.3.  Al¬ 
though  more  than  1,300  children  were  included  in  the  4-year-old 
cohort,  in  our  empirical  sample  only  715  children  had  complete 
data  on  all  of  the  outcome  variables  related  to  mathematics,  and 
758  had  complete  data  on  the  literacy  outcomes.  Sample  sizes 
varied  due  to  the  presence/absence  of  variables  in  the  model.  The 
sources  of  missingness  included  missing  outcome  variables  (math, 
N  —  229;  literacy,  N  =  209),  missing  sampling  weight  variable 
(math,  N  =  106;  literacy,  N  =  110),  and  the  combination  of 
missing  covariates,  variables  of  interest,  and/or  design  variables 
(math,  N  =  278;  literacy,  N  =  274).  These  numbers  were  slightly 
reduced  again  when  we  limited  our  sample  to  only  children  who 
remained  in  the  same  classroom  from  fall  to  spring  (math,  N  = 
166;  literacy,  N  =  176).  Thus,  our  final  empirical  sample  was  N  = 
549  and  N  —  582  for  models  with  math  and  literacy  outcomes, 
respectively. 

In  general,  our  largest  source  of  missing  data  was  missing 
sampling  weight  variables  and  the  combination  of  missing  values 
for  other  variables  of  interest  and  the  sampling  weight.  The 
FACES  user  guide  notes  that  sampling  weights  provided  with  the 
data  account  for  selection  into  the  sample,  attrition  over  time,  and 
participant  nonresponse  (i.e.,  missing  data  at  the  instrument  level; 
U.S.  Department  of  Health  and  Human  Services,  2013).  That  is, 
sampling  weight  variables  have  been  adjusted  for  missingness 
because  of  the  individual’s  chance  of  being  selected  and  whether 
their  parents,  teachers,  or  center  directors  responded  on  an  instru¬ 
ment  in  the  FACES  survey.  Despite  the  fact  that  missingness  was 
adjusted  with  the  sampling  weight  variable,  we  tested  whether 
demographic  and  child  outcome  variables  from  the  first  time  point 
(fall  2009)  significantly  predicted  who  had  missing  data  at  the 
second  (spring  2010)  time  point.  The  results  from  analyses  using 
SURVEYLOGISTIC  in  SAS  9.3  showed  that  none  of  the  demo¬ 
graphic  and  child  outcome  variables  in  fall  2009  were  significantly 
related  to  whether  or  not  children  remained  in  the  study  in  spring 
2010.  We  used  this  same  process  to  test  for  differences  between 
students  who  moved  from  one  classroom  to  another  during  the 
school  year  and  found  that  none  of  the  demographic  and  child 
outcome  variables  from  fall  2009  significantly  predicted  differ¬ 
ences  between  students  who  moved  and  those  who  did  not. 

Measures 

Academic  skills.  Children’s  early  math  skills  were  measured 
with  the  Applied  Problems  subtest  from  the  Woodcock-Johnson 
Tests  of  Achievement — III  (WJ-III;  Woodcock,  McGrew,  & 
Mather,  2007)  or  the  Problemas  Aplicados  subtest  from  the  Baterfa 
III:  Woodcock-Munoz  (WM-III;  Woodcock,  Munoz-Sandoval, 
McGrew,  &  Mather,  2007).  Applied  Problems  will  be  used  to  refer 
to  this  subscale  from  both  the  WJ-III  and  the  WM-III.  This  subtest 
measures  children’s  ability  to  analyze  information  and  solve  prob¬ 
lems  (typically  using  counting,  addition,  or  subtraction).  Chil¬ 
dren’s  early  literacy  skills  were  measured  with  the  Letter-Word 
subtest  from  the  WJ-III  or  the  Identification  de  Letras  y  Palabras 
from  the  WM-III.  Letter-Word  will  be  used  to  refer  to  this  subscale 
from  both  the  WJ-III  and  the  WM-III.  This  subtest  measures 
children’s  ability  to  correctly  label  letters  and  words  shown  indi¬ 
vidually  on  a  book  page.  We  used  the  growth  or  W-scores  from 


both  the  WJ-III  and  WM-III  because  the  mean  of  the  W-scores  will 
change  as  children  progress,  demonstrating  growth  and  facilitating 
comparisons  over  time  as  scores  are  linked  to  the  same  develop¬ 
mental  scale  (U.S.  Department  of  Health  and  Human  Services, 

2013) .  The  FACES  2009  user’s  manual  (U.S.  Department  of 
Health  and  Human  Services,  2013)  reports  reliability  for  the 
W-scores  with  4-year-old  children  as  a  =  .94  and  a  =  .93  for  the 
Applied  Problems  subtest  on  the  WJ-III  and  the  WM-III,  respec¬ 
tively,  and  as  a  =  .98  and  a  =  .84,  for  the  Letter-Word  subtest  on 
the  WJ-III  and  the  WM-III,  respectively. 

To  evaluate  the  potential  for  differential  prediction  of  spring 
WJ-III  and  WM-III  scores,  we  tested  the  interaction  between 
academic  outcome  scores  (Applied  Problems/Letter-Word)  at 
Time  1  (fall)  and  assessment  language.  None  of  the  interactions 
were  statistically  significant,  indicating  that  scores  from  the  WJ 
and  WM  did  not  function  differently  here  (highest  B  =  -0.16, 
SEb  =  0. 10,  p  =  .13;  Letter-Word). 

Temperamental  regulation.  Temperamental  regulation  was 
measured  using  observations  of  children’s  behavior  during  direct 
assessments  via  the  Leiter-R  Examiner  Rating  Scales  (Roid  & 
Miller,  1997)  that  have  been  used  previously  to  measure  children’s 
regulation  (e.g.,  Chazan-Cohen  et  al.,  2009).  We  used  three  sub¬ 
scales1:  Attention  (10  items;  “Pays  attention  to  details  within 
tasks”),  Activity  (4  items;  “Focuses  without  fidgeting,  restlessness, 
or  gazing  elsewhere”),  and  Impulse  Control  (8  items;  “Inhibits 
verbalizations  appropriately;  does  not  ‘blurt  out’”).  Assessors  rated 
children  from  0  ( rarely/never )  to  3  (usually/always)  on  each  item. 
Results  from  a  large,  randomized  trial  conducted  with  2-  to  4-year- 
olds  (Olds  et  al.,  2004)  suggest  that  scores  from  these  scales 
demonstrate  convergent  evidence  with  scores  for  language  devel¬ 
opment  and  executive  functioning;  specifically,  at-risk  children 
exposed  to  a  home  visitation  intervention  experienced  significant 
improvements  in  their  behavior  during  testing  (as  measured  using 
the  Leiter-R  Examiner  Rating  Scales;  ES  =  .38),  as  well  as 
language  (ES  =  .31),  and  executive  function  (ES  =  .47).  The 
FACES  user’s  manual  reports  Cronbach’s  alphas  for  these  scales 
ranging  from  0.93-0.97  (U.S.  Department  of  Health  and  Human 
Services,  2013).  The  three  subscales  were  highly  correlated  (all 
rs  >  0.80),  so  we  created  a  composite  factor  score  from  the 
subscales  using  M plus  version  7.0  (Muthen  &  Muthen,  1998- 
201 2). 2  Factor  scores  are  used  as  proxies  for  latent  variables 
because  they  are  the  scores  that  would  have  been  observed  if  it 
were  possible  to  measure  a  latent  factor  directly  (Brown,  2012). 
This  composite  factor  score  was  used  in  all  subsequent  analyses 
(composite  reliability  to  =  0.96;  Geldhof,  Preacher,  &  Zyphur, 

2014) ,  and  the  variable  is  referred  to  as  Temperamental  Regulation 
(weighted  M  =  —0.25,  SE  =  0.23). 

Classroom  quality.  The  Classroom  Assessment  Scoring  Sys¬ 
tem  (CLASS;  Pianta,  LaParo,  &  Hamre,  2008)  was  used  to  eval¬ 
uate  classroom  quality.  Observations  were  conducted  in  the  spring 
for  at  least  4  hr  on  one  day  (conducted  in  four  30-min  cycles;  20 
min  of  observation  followed  by  10  min  of  recording  and  coding). 
Trained  researchers  conducted  classroom  observations  using  the 


1  Item-level  information  as  well  as  interrater  agreement,  training,  and 
qualification  of  raters  for  the  Leiter-R  was  not  provided  in  the  dataset  or 
manual. 

2  M plus  uses  maximum  a  posteriori  scoring  to  generate  factor  scores 
(MAP;  Embretson  &  Reise,  2000). 
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CLASS  and  were  required  to  achieve  agreement  with  master 
coders  on  three  videotapes  (within  1  point  on  80%  of  the  CLASS 
dimensions;  Pianta  et  al.,  2008)  prior  to  conducting  observations  in 
the  field. 

For  the  current  study,  we  included  domain  scores  for  Instruc¬ 
tional  Support,  Emotional  Support,  and  Classroom  Organization. 
Instructional  Support  (weighted  M  =  2.26,  SE  =  0.05)  is  calcu¬ 
lated  as  the  average  of  three  dimensions  measuring  language 
modeling,  quality  of  feedback,  and  concept  development.  Emo¬ 
tional  Support  (weighted  M  =  5.26,  SE  =  0.06)  is  a  composite 
measure  of  four  dimensions  measuring  classroom  positive  climate, 
negative  climate  (reversed),  teacher  sensitivity,  and  regard  for 
student  perspectives.  Classroom  Organization  (weighted  M  = 
4.68,  SE  =  0.05)  comprises  the  average  of  three  dimensions 
measuring  behavior  management,  classroom  productivity,  and  in¬ 
structional  learning  formats.  All  dimensions  were  scored  using  a 
scale  from  1  (low)  to  7  (high).  The  three  CLASS  domain  variables 
were  moderately  correlated  with  one  another.  The  strongest  rela¬ 
tionship  was  found  between  Emotional  Support  and  Classroom 
Organization  (r  =  .70),  whereas  weaker  relationships  were  found 
between  Instructional  Support  and  Classroom  Organization  (r  = 
.49),  and  Instructional  Support  and  Emotional  Support  (and  r  = 
.51).  According  to  the  FACES  user  guide,  coefficient  alphas 
ranged  from  .79  for  Instructional  Support  to  .91  for  Emotional 
Support.  Additionally,  joint  observations  with  master  coders  at  the 
beginning,  middle,  and  end  of  the  field  data  collection  period 
indicated  that  interrater  agreement  (within  1  point  of  master  coder) 
was  high  (95%-96%)  across  all  time  points  (U.S.  Department  of 
Health  and  Human  Services,  2013).  Instructional  Support  and 
Emotional  Support  scores  have  been  associated  with  preschoolers’ 
academic  skills  and  behavior.  For  example,  Instructional  Support 
was  positively  associated  with  language,  literacy,  and  math  skills 
((3s  ranged  from  .33  to  .69)  and  Emotional  Support  was  associated 
with  social  competence  ((3  =  .06),  and  behavior  problems 
((3  =  —.05)  in  preschoolers  (Mashburn  et  al.,  2008).  Organization 
in  kindergarten  classrooms  was  associated  with  more  self-control 
(behavioral,  [3  =  .32;  cognitive,  (3  =  .24)  and  positive  work  habits 
((3  =  .22;  Rimm-Kaufman  et  al.,  2009). 

Cumulative  economic  risk.  Consistent  with  other  studies  of 
cumulative  risk  (e.g.,  Appleyard  et  al.,  2005;  Crosnoe  &  Cooper, 
2010;  Evans,  2003),  the  Cumulative  Economic  Risk  variable  was 
calculated  as  the  simple  sum  of  dichotomous  (0  =  no,  1  =  yes) 
scores  for  three  indices  of  economic  risk:  single-parent  household, 
mother’s  education  less  than  high  school  diploma,  and  household 
income  below  federal  poverty  threshold.  Correlations  among  the 
three  risk  indices  ranged  from  r  =  —.10  to  .19.  Cumulative 
Economic  Risk  scores  were  calculated  in  the  fall  of  2009  and 
ranged  from  0-3.  The  majority  of  children  in  the  empirical  sample 
had  either  one  or  two  risk  factors  (0  risk  factors,  13%;  one  risk 
factor,  34%;  two  risk  factors,  40%;  three  risk  factors,  13%).  We 
were  particularly  interested  in  understanding  how  children  with  the 
highest  level  of  Cumulative  Economic  Risk  differed  from  other 
children  in  the  sample,  so  we  divided  children  into  two  Cumulative 
Economic  Risk  groups:  (a)  highest  risk  (i.e.,  children  with  three  risk 
factors;  AAppUedProblems  =  63  and  yVLetterWord  =  70)  and  (b)  lower  risk 
(children  with  zero,  one,  or  two  risk  factors;  AAppiiedProblems  =  455 
and  Afterword  =  479). 

Control  variables.  Child  demographic  variables  (gender,  age, 
and  race/ethnicity  recoded  as  non-Hispanic  White,  and  minority),3 


fall  assessment  scores  for  Applied  Problems  or  Letter-Word,  and 
assessment  language  (English  or  Spanish)  were  used  as  control 
variables  in  analyses. 

Data  Analysis 

Outcomes  were  children’s  performance  on  math  and  literacy 
assessments  at  the  end  of  prekindergarten  (spring  2010).  Predictors 
included  the  control  variables,  performance  on  the  math  or  literacy 
assessment  at  the  beginning  of  prekindergarten  (fall  2009),  Tem¬ 
peramental  Regulation,  classroom  quality  (Instructional  Support, 
Emotional  Support,  Organization),  Cumulative  Economic  Risk, 
and  all  two-  and  three-way  interactions  between  Temperamental 
Regulation,  classroom  quality,  and  Cumulative  Economic  Risk. 
See  the  online  supplemental  material  appendix  for  a  general  ex¬ 
pression  of  the  statistical  model  used  in  this  study.  Before  model¬ 
ing,  we  tested  the  data  for  the  presence  of  heteroscedasticity  and 
multicollinearity.  Given  the  complex  data  structure,  traditional 
methods  for  assessing  model  assumptions  were  not  available,  so 
we  used  evidence  from  both  survey  and  traditional  approaches  to 
evaluate  model  assumptions  (e.g.,  Lohr,  2012).  Our  results  showed 
no  evidence  of  heteroscedasticity  or  multicollinearity. 

The  interactions  of  child  Temperamental  Regulation,  classroom 
quality,  and  Cumulative  Economic  Risk  were  examined  in  separate 
regression  models  for  each  of  the  two  outcomes  (math  and  literacy 
performance)  using  a  top-down  approach,  which  is  an  iterative, 
exploratory  model-building  approach  (Ryoo,  2011)  in  which  we 
started  with  the  most  complex  model  (i.e.,  all  possible  interactions 
with  the  central  variables  of  interest)  and  removed  interactions  that 
were  nonsignificant  (p  >  .05).  Each  model  always  included  the 
control  variables  and  variables  of  substantive  interest  (i.e..  Tem¬ 
peramental  Regulation,  CLASS  variables,  Cumulative  Economic 
Risk),  but  only  significant  higher  order  interactions  (three-way  and 
two-way  interactions)  were  retained.  We  also  tested  the  interac¬ 
tions  between  each  of  the  covariates  with  the  central  variables  to 
test  the  assumption  that  they  do  not  interact  with  our  variables  of 
interest.  There  were  some  significant  interactions  between  our 
control  variables  and  central  variables,  so  these  significant  inter¬ 
actions  were  included  in  the  final  models.  The  final  models  were 
the  end  result  of  this  model  building  process. 

Variables  were  centered  to  aid  in  parameter  interpretation.  Con¬ 
tinuous  variables  were  centered  at  the  sample  average.  Discrete 
variables  were  coded  so  the  reference  group  was  comprised  of 
non-Hispanic,  White  males  who  completed  assessments  in  English 
and  indicated  lower  cumulative  economic  risk.  Given  this  center¬ 
ing  method,  the  estimated  regression  effects  can  be  interpreted  as 
the  change  in  the  predicted  outcome  (math  or  literacy  scores)  for 
every  one-unit  change  in  a  particular  variable,  holding  all  other 
variables  constant  at  the  centering  point. 

Our  main  research  question  focused  on  whether  classroom  qual¬ 
ity  moderates  associations  between  temperament  and  children’s 
early  math  and  literacy  skills  at  the  Wo  levels  of  Cumulative 


3  We  evaluated  two  coding  schemes  for  the  race/ethnicity  variable  with 
both  of  our  models:  dichotomous  coding  of  non-Hispanic  White,  and 
minority;  and  four  categorical  codes  of  the  following:  (a)  non-Hispanic, 
White;  (b)  Hispanic/Latino;  (c)  African  American,  non-Hispanic;  and  (d) 
other.  In  both  models,  the  more  complicated  categorical  scheme  did  not 
indicate  significant  differences  among  the  four  groups.  Subsequently,  we 
used  the  dichotomous  coding  scheme. 
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Economic  Risk  (highest,  three  risk  factors;  lower,  zero,  one,  or  two 
risk  factors).  We  also  examined  associations  between  the  lower 
older  ettects  (i.e.,  two-way  interactions  and  main  effects)  and 
achievement,  however,  these  were  interpreted  only  in  the  absence 
of  higher  ordei  ettects.  To  facilitate  interpretation  of  interactions 
of  Temperamental  Regulation,  classroom  quality,  and  Cumulative 
Economic  Risk,  we  refer  to  high  (1  SD  >  mean),  medium  (mean), 
and  low  ( 1  SD  <  mean)  levels  of  each  continuous  variable. 
Cohen’s/'  was  used  as  a  measure  of  local  effect  size  following  the 
procedures  described  in  Selya  et  al.  (2012),  with  “small”  effects 
defined  as  f  >  0.02  and  <  .15,  “medium”  effects  defined  as / 2  > 
0.15  and  <  .35,  and  “large”  effects  defined  as  f  >  0.35  (Cohen, 
1992). 

We  conducted  post  hoc  simple  slopes  tests  to  probe  the  signif¬ 
icant  three-way  interactions  and  controlled  for  familywise  Type  I 
error  by  risk  category  using  the  Bonferroni-Holm  procedure 
(Holm,  1979).  Because  the  main  analysis  showed  predicted  values 
for  children’s  math  and  literacy  skills  at  the  mean  value  of  each 
CLASS  variable  (e.g..  Instructional  Support),  simple  slopes  tests 
were  conducted  at  high  (1  SD  >  mean)  and  low  (1  SD  <  mean) 
levels  of  the  CLASS  variables  for  the  highest  and  lower  risk 
groups.  The  p  values  for  the  slopes  within  each  risk  category  were 
ordered  from  smallest  to  largest  and  alpha  (.05)  was  then  divided 
by  3,  2,  and  1,  respectively  to  obtain  adjusted  p  values.  Starting 
with  the  smallest  p  value,  the  adjusted  p  values  />BH)  were  se¬ 


quentially  compared  to  .05  until  the  first  nonsignificant  test  (i.e., 
pBH  >  .05)  was  identified,  and  the  remaining  ones  were  declared 
nonsignificant. 

Results 

Classroom  Quality,  Cumulative  Economic  Risk, 
Temperamental  Regulation,  and  Children’s  Skills 

Applied  Problems.  The  left  side  of  Table  2  provides  the 
final  model  and  the  unstandardized  parameter  estimates  for 
Applied  Problems.  The  results  showed  a  significant  three-way 
interaction  between  Temperamental  Regulation,  Instructional 
Support,  and  Cumulative  Economic  Risk  ( B  —  3.85,  SEb  =  1.11, 
p  <  .01  ,/2  =  .01).  Simple  slopes  analyses  were  conducted  at  high 
and  low  values  of  Instructional  Support  (one  unweighted  standard 
deviation  above  and  below  the  mean)  within  each  of  the  Cumula¬ 
tive  Economic  Risk  categories  (Aiken  &  West,  1991).  All  simple 
slopes  are  reported  as  standardized  values  to  coincide  with  the 
figures.  With  the  highest  Cumulative  Economic  Risk  group,  the 
simple  slopes  of  Temperamental  Regulation  on  Applied  Problems 
scores  were  -0.18  (pBH  <  .01)  and  0.26  (pBH  <  .01)  at  low  and 
high  values  of  Instructional  Support,  respectively.  With  the  lower 
Cumulative  Economic  Risk  group,  the  simple  slopes  of  Tempera¬ 
mental  Regulation  on  Applied  Problems  scores  were  0.01  (pBH  >  .05) 


Table  2 


Estimated  Regression  Coefficients  for  Spring  (T2)  Applied  Problems  and  Letter-Word  Scores 


Parameter 

Applied  Problems 

Letter-Word 

Estimate 

SE 

P 

f2 

Estimate 

SE 

P 

f2 

Intercept 

402.21 

1.75 

<.01 

333.06 

2.46 

<.01 

T1  Race  (Minority) 

-4.58 

1.52 

<.01 

.01 

-3.35 

2.48 

.18 

.00 

T1  Risk  (Highest) 

2.19 

1.83 

.24 

.00 

3.61 

3.55 

.31 

.00 

T1  Gender  (Female) 

2.38 

1.39 

.09 

.01 

-0.24 

1.75 

.89 

.00 

T1  Age 

0.56 

0.19 

.01 

.02 

0.66 

0.20 

<.01 

.02 

T1  Assessment  Score 

0.49 

0.04 

<.01 

.36 

0.63 

0.04 

<.01 

.44 

T1  Assessment  Language  (Spanish) 

-12.52 

3.82 

<.01 

.04 

4.92 

3.55 

.17 

.00 

T2  Assessment  Language  (Spanish) 

2.23 

4.76 

.64 

.00 

-3.32 

3.78 

.38 

.00 

T2  Child  Temperament 

1.59 

0.45 

<.01 

.03 

1.88 

0.73 

.01 

.01 

T2  Classroom  Organization 

1.81 

1.92 

.35 

.00 

-1.25 

2.53 

.62 

.00 

T2  Instructional  Support 

0.22 

1.91 

.91 

.00 

-0.47 

2.84 

.87 

.00 

T2  Emotional  Support 

-2.77 

2.21 

.22 

.00 

0.26 

2.10 

.90 

.00 

T1  Assessment  Language  X  Emotional  Support3 

— 

— 

— 

— 

-17.70 

5.44 

<.01 

.02 

T2  Assessment  Language  X  Classroom  Organization3 

-9.21 

4.22 

<.01 

.01 

11.81 

5.21 

.03 

.00 

Race  X  Instructional  Support3 

— 

— 

— 

— 

8.43 

3.24 

.01 

.01 

Temperament  X  Classroom  Organization 

— 

— 

— 

— 

0.10 

1.09 

.93 

.00 

Temperament  X  Instructional  Support 

1.37 

0.61 

.03 

.01 

— 

— 

— 

— 

Temperament  X  Emotional  Support 

— 

— 

— 

— 

-0.50 

0.84 

.55 

.00 

Risk  X  Classroom  Organization 

— 

— 

— 

— 

8.47 

5.21 

.11 

.00 

Risk  X  Instructional  Support 

-3.24 

3.67 

.38 

.00 

— 

— 

— 

— 

Risk  X  Emotional  Support 

— 

— 

— 

— 

-12.88 

5.69 

.03 

.01 

Temperament  X  Risk 

-0.63 

0.74 

.40 

.00 

1.65 

1.41 

.25 

.00 

Temperament  X  Classroom  Organization  X  Risk 

— 

— 

— 

— 

4.88 

1.92 

.01 

.01 

Temperament  X  Instructional  Support  X  Risk 

3.85 

1.11 

<.01 

.01 

— 

— 

— 

— 

Temperament  X  Emotional  Support  X  Risk 

— 

— 

— 

— 

—5.45 

2.16 

.02 

.00 

Note.  Estimates  were  weighted  using  the  P12WT  variable;  Continuous  variables  were  mean  centered.  The  reference  group  for  categorical  variables  was 
non-Hispanic,  White  males  who  completed  assessments  in  English  and  had  lower  cumulative  economic  risk.  “Small”  effects  are  defined  as/2  >  .02  to  <.15, 
“medium”  effects  defined  as  f2  <  .15  to  <.35,  and  “large”  effects  defined  as/2  >  .35  (Cohen,  1992).  Boldface  type  indicates  coefficients  statistically 
significant  at  p  <  .05.  T  =  time. 

a  Interactions  with  control  variables  were  included  in  the  models  if  preliminary  tests  were  statistically  significant,  but  these  interactions  are  not  interpreted 
because  they  are  not  of  substantive  interest. 
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and  0.12  (pBH  <  .01)  at  low  and  high  values  of  Instructional 
Support,  respectively. 

Figure  1  shows  the  predicted  Applied  Problems  scores  for  lower 
and  highest-risk  students.  Holding  all  other  variables  constant  at 
their  centering  point,  highest-risk  students  in  classrooms  with  low 
Instructional  Support  were  predicted  to  score  lower  on  the  Applied 
Problems  assessment  as  their  Temperamental  Regulation  in¬ 
creased.  In  contrast,  highest-risk  students  in  classrooms  with  high 
Instructional  Support  were  predicted  to  score  higher  on  the  Ap¬ 
plied  Problems  assessment  as  Temperamental  Regulation  in¬ 
creased.  Among  the  lower  risk  students,  those  in  classrooms  with 
high  Instructional  Support  were  predicted  to  score  higher  on  the 
Applied  Problems  assessment  as  Temperamental  Regulation  in¬ 
creased. 

Letter-Word.  The  right  side  of  Table  2  provides  the  final 
model  and  the  unstandardized  parameter  estimates  corresponding 
to  the  individual  variables  and  interactions  for  Letter- Word.  Our 
results  showed  a  significant  interaction  with  Temperamental  Reg¬ 
ulation,  Classroom  Organization,  and  Cumulative  Economic  Risk 
( B  =  4.88,  SEb  =  1.92,  p  <  .05,  f2  =  .01).  When  risk  was  highest, 
the  simple  slopes  of  Temperamental  Regulation  were  —0.04 
(Pbh  >  -05)  and  0.21  (pBH  <  .05)  at  low  and  high  values  of 
Classroom  Organization,  respectively.  When  risk  was  lower,  the 
simple  slopes  of  Temperamental  Regulation  were  0.04  (pBH  >  .05) 
and  0.05  (pBH  >  .05),  at  low  and  high  values  of  Classroom 
Organization,  respectively.  As  Figure  2  shows,  holding  all  other 
variables  constant  at  their  centering  point,  among  the  highest-risk 
students,  those  in  classrooms  with  high  Organization  were  pre¬ 
dicted  to  perform  better  on  the  Letter-Word  assessment  as  their 
Temperamental  Regulation  increased. 

There  was  also  a  significant  three-way  interaction  between 
Temperamental  Regulation,  Emotional  Support,  and  Cumulative 
Economic  Risk  ( B  =  —5.45,  SEb  =  2.16,  p  <  .05,  f2  =  .00). 
However,  because  of  the  negligible/2  value,  we  did  not  interpret 
this  interaction. 

Discussion 

Our  examination  of  Head  Start  classroom  quality  as  a  moderator 
between  children’s  temperamental  regulation  and  their  early  math 


and  literacy  skills  across  varying  levels  of  cumulative  economic 
risk  produced  three  main  findings.  First,  for  children  in  both  risk 
groups  (highest  and  lower  risk),  more  instructional  support  was 
associated  with  better  math  performance  for  those  with  high  levels 
of  temperamental  regulation,  but  poorer  performance  for  those 
with  low  temperamental  regulation.  Second,  among  highest  risk 
children,  low  instructional  support  was  protective  for  those  with 
low  temperamental  regulation  and  detrimental  for  those  with  high 
temperamental  regulation.  Third,  for  highest  risk  children,  high 
classroom  organization  predicted  higher  literacy  scores  for  chil¬ 
dren  with  high  temperamental  regulation.  Children  with  low  tem¬ 
peramental  regulation  were  expected  to  perform  about  the  same, 
regardless  of  the  level  of  classroom  organization.  Each  of  these 
findings  will  be  discussed  in  turn. 

Instructional  Support  as  Protective 

Classroom  instructional  support  emerged  as  a  moderator  be¬ 
tween  children’s  temperamental  regulation  and  math  performance 
(see  Figure  1).  For  lower  and  highest-risk  children,  results  indicate 
that  high  instructional  support  may  function  as  a  protective  factor 
for  math  performance  of  children  with  higher  temperamental  reg¬ 
ulation,  but  may  be  detrimental  for  children  with  lower  tempera¬ 
mental  regulation.  Also,  for  highest  risk  children,  low  instructional 
support  appears  to  be  detrimental  for  children  with  high  temper¬ 
amental  regulation  but  protective  for  those  with  low  temperamen¬ 
tal  regulation. 

Although  not  fully  congruent  with  our  hypotheses,  these  find¬ 
ings  are  similar  to  those  from  a  study  of  Head  Start  children  by 
Dominguez  et  al.  (2011)  where  children  with  more  problems  in 
teacher  interactions  (and  perhaps  similar  to  children  with  low 
temperamental  regulation  in  the  current  study)  had  lower  levels  of 
approaches  to  learning  (ATL)  skills  in  high  instructional  support 
classrooms,  whereas  in  low  instructional  support  classrooms,  prob¬ 
lems  in  teacher  interactions  were  unrelated  to  ATL  skills.  Our 
findings  are  also  related  to  results  from  Vitiello  et  al.  (2012)  where 
resilient  children  (similar  to  children  characterized  by  higher  tem¬ 
peramental  regulation  in  the  current  study)  benefitted  more  from 
high  instructional  support  than  their  over-  or  undercontrolled  peers 
(similar  to  our  low  temperamental  regulation  children).  Like  both 
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Figure  1.  Predicted  values  in  standard  deviation  units  for  lower  risk  students  (left  image)  and  higher  risk 
students  (right  image)  for  Applied  Problems  assessment  scores  based  on  temperamental  regulation  (TR)  and 
preschool  classroom  instructional  support  (IS),  holding  all  other  variables  constant  at  their  centering  point.  High, 
medium,  and  low  values  were  based  on  the  unweighted  standard  deviation  (1  SD  above,  at  the  mean,  and  1  SD 
below  the  mean). 
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Figure  2.  Predicted  values  in  standard  deviation  units  for  lower  risk  students  (left  image)  and  higher  risk 
students  (right  image)  for  Letter-Word  assessment  scores  based  on  temperamental  regulation  (TR)  and  preschool 
classroom  organization  (CO)  holding  all  other  variables  constant  at  their  centering  point.  High,  medium,  and  low 
values  were  based  on  the  unweighted  standard  deviation  (1  SD  above,  at  the  mean,  and  1  SD  below  the  mean). 


Dominguez  et  al.  (2011)  and  Vitiello  et  al.  (2012),  our  results 
suggest  that  higher  quality  classrooms  may  be  variably  beneficial, 
depending  on  children’s  characteristics. 

Results  reported  here  suggest  that  higher  quality  teacher- child 
interactions,  particularly  instructional  support,  are  likely  to  pro¬ 
mote  greater  math  skill  development  but  perhaps  only  when  chil¬ 
dren  have  the  regulatory  skills  necessary  to  benefit  from  a  more 
intensive  and  demanding  level  of  instruction  (see  Burchinal, 
Howes,  et  al.,  2008;  Mashbum  et  al.,  2008;  Pianta  et  al.,  2008).  In 
the  current  study,  this  was  especially  true  for  the  highest  risk 
children — if  they  had  low  temperamental  regulation,  their  math 
performance  was  expected  to  decrease  as  instructional  support 
increased.  Some  studies  indicating  that  instructional  support  pro¬ 
motes  positive  outcomes  for  all  children  have  included  economi¬ 
cally  disadvantaged  samples  (e.g.,  Mashbum  et  al.,  2008).  How¬ 
ever,  studies  showing  that  instructional  support  can  be  protective 
for  acquisition  of  academic  and  social  skills  among  children  at-risk 
due  to  demographic  or  temperament  characteristics  have  included 
samples  with  relatively  low  levels  of  economic  risk  (e.g.,  Curby  et 
al.,  2011;  Hamre  &  Pianta,  2005).  It  appears  that  the  early  math 
skill  development  of  children  with  the  highest  level  of  cumulative 
economic  risk  is  more  dependent  on  their  abilities  to  regulate  than 
for  children  from  lower  risk  homes.  Together  with  our  finding  that 
high  instructional  support  was  protective  only  for  math  skills  of 
children  with  high  temperamental  regulation  (and  detrimental  for 
children  with  low  temperamental  regulation),  our  results  suggest 
that  prioritizing  interventions  targeting  children’s  regulatory  de¬ 
velopment  may  be  particularly  beneficial  for  facilitating  learning 
in  children  with  more  cumulative  economic  risk. 

The  fact  that  high  instructional  support  appeared  detrimental  for 
children  with  less  temperamental  regulation  among  the  highest  risk 
group  may  also  be  understood  in  the  context  of  the  instructional 
support  scores  obtained  across  classrooms  in  this  study.  As  is 
typical  in  classroom  quality  research  (e.g.,  Mashbum  et  al.,  2008), 
the  average  level  of  instructional  support  in  this  study  was  very 
low  (unweighted  M  =  2.27).  Even  classrooms  with  “high”  instruc¬ 
tional  support  (2.87  in  this  sample;  1  SD  above  the  unweighted  M) 
had  scores  falling  short  of  a  classification  as  “moderately  instruc- 
tionally  supportive  as  defined  in  the  CLASS  manual  (Pianta  et  al., 
2008).  Thus,  the  generally  low  levels  of  instructional  support  in 
this  sample  may  have  masked  some  associations  that  would  per¬ 


haps  have  emerged  in  a  sample  with  greater  variability.  Likewise, 
it  could  be  that  observed  levels  of  instructional  support  in  this 
study  were  not  high  enough  to  impact  the  math  performance  of 
children  with  low  temperamental  regulation.  Indeed,  Burchinal, 
Vandergrift,  Pianta,  and  Mashbum  (2010)  identified  3.25  as  a 
threshold  score  for  instructional  support  on  the  scale  used  here, 
such  that  effects  of  instructional  support  on  children’s  academic 
skills  were  larger  in  classrooms  of  moderate  or  high  instructional 
support  (i.e.,  3.25  or  above)  compared  with  classrooms  with  low 
instructional  support  (i.e.,  under  3.25) — a  category  that  would 
have  included  all  of  the  classrooms  in  our  data  set. 

Classroom  Organization,  Temperamental  Regulation, 
and  Cumulative  Economic  Risk 

Classroom  organization  moderated  the  association  between  tem¬ 
peramental  regulation  and  literacy  performance  but  only  for  chil¬ 
dren  with  highest  cumulative  economic  risk.  That  is,  similar  to 
findings  for  instructional  support,  we  found  that  high  classroom 
organization  appeared  protective  for  children  with  highest  eco¬ 
nomic  risk  but  only  if  they  also  had  high  temperamental  regulation 
(see  Figure  2).  On  the  other  hand,  children  with  low  temperamen¬ 
tal  regulation  were  predicted  to  perform  about  the  same,  regardless 
of  the  level  of  organization  in  their  classroom.  These  results 
indicate  that  the  benefits  of  classroom  organization  may  matter 
more  for  children  with  already  high  levels  of  regulation,  especially 
in  situations  where  children  face  elevated  cumulative  economic 
risk.  Although  classroom  organization  is  intended  to  facilitate 
children’s  self-regulation  development  and,  in  turn,  promote  aca¬ 
demic  success  (Downer  et  al.,  2010),  findings  from  the  current 
study  suggest  that  training  children’s  regulation  skills  may  be 
necessary  for  optimizing  high-risk  children’s  early  academic  out¬ 
comes  (O’Connor  et  al.,  2014). 

Limitations 

This  study  has  several  limitations  that  warrant  mention.  First,  in 
any  longitudinal  study,  particularly  using  data  from  multiple 
sources,  data  are  likely  to  be  missing  due  to  attrition  and  partici¬ 
pant  nonresponse.  This  is  even  more  likely  when  a  study  includes 
a  large  proportion  of  children  at-risk  due  to  economic  disadvan- 
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tage,  as  was  the  case  here.  Although  a  number  of  children  in  the 
original  sample  had  missing  data,  the  use  of  sampling  weights 
mitigates  concerns  regarding  generalizability  and  some  nonre¬ 
sponse  bias.  Second,  the  correlational  nature  of  this  study  prevents 
any  strong  conclusions  about  causal  relationships  between  chil¬ 
dren’s  temperamental  regulation,  cumulative  economic  risk,  class¬ 
room  quality,  and  academic  performance.  There  is  a  need  for 
experimental  work  that  may  more  precisely  tease  apart  the  mech¬ 
anisms  by  which  certain  classroom  processes  and  regulation- 
focused  interventions  may  be  effective  for  improving  children’s 
learning.  Third,  effect  sizes  from  this  study  are  small,  as  is  typical 
of  classroom-based  research  (e.g.,  Bulotsky-Shearer,  Dominguez, 
&  Bell,  2012;  Hamre  &  Pianta,  2005;  Reyes,  Brackett,  Rivers, 
White,  &  Salovey,  2012)  as  well  as  with  studies  of  specific  subsets 
of  children  from  large  representative  samples  (e.g.,  Adelson,  Mc- 
Coach,  &  Gavin,  2012).  Fourth,  although  our  measure  of  cumu¬ 
lative  economic  risk  included  the  most  robust  indicators  of  eco¬ 
nomic  disadvantage  (i.e.,  income  below  poverty  and  mothers’ 
education  below  high  school;  Crosnoe  &  Cooper,  2010),  research 
suggests  that  there  are  other  indicators  related  to  economic  risk, 
such  as  food  insufficiency  or  household  density,  that  may  add 
meaningfully  to  our  understanding  of  the  impact  of  cumulative  risk 
on  children’s  outcomes  (e.g.,  Burchinal,  Vemon-Feagans,  Cox,  & 
Key  Family  Life  Project  Investigators,  2008).  Fifth,  we  were 
unable  to  measure  the  density  of  high  cumulative  economic  risk  as 
a  classroom-level  variable  because  only  10  children  per  Head  Start 
classroom  were  included  in  the  study.  Sixth,  our  measure  of 
temperament  was  restricted  to  regulation;  future  research  in  this 
area  should  include  indices  of  temperamental  reactivity,  especially 
considering  the  emerging  evidence  of  the  importance  of  these 
reactive  components  (e.g.,  shyness,  anger)  to  children’s  academic 
and  social  success  in  early  childhood  classrooms  (e.g.,  Justice, 
Cottone,  Mashbum,  &  Rimm-Kaufman,  2008;  Vitiello  et  al., 
2012).  Seventh,  classroom  observations  took  place  on  one  day  in 
the  spring;  it  is  possible  that  classroom  quality  varied  more  than 
this  sampling  strategy  revealed.  However,  multiple  studies  using 
the  CLASS  to  assess  classroom  quality  in  a  single  day  have  found 
consistent  results  (Araujo  et  al.,  2014;  Burchinal,  Howes,  et  al., 
2008;  Leyva  et  al.,  2015;  Mashburn  et  al.,  2008).  Although  the 
single-day  observation  methodology  has  limits,  evidence  from  the 
CLASS  manual  (Pianta  et  al.,  2008)  suggests  that  there  is  consis¬ 
tency  with  observed  teacher  behavior  that  can  be  accounted  for  by 
conducting  multiple  observation  cycles  on  one  day  and  averaging 
the  score,  as  was  done  here.  Finally,  there  was  a  large  difference 
in  sample  size  between  the  highest  and  lower  risk  groups.  These 
sample  size  differences  can  lead  to  some  instability  in  the  model 
due  to  the  small  group  size  of  children  in  the  highest-risk  category 
(<100). 

Implications 

Findings  reported  here  have  implications  for  research  and  prac¬ 
tice.  Our  results  most  clearly  point  to  the  importance  of  consider¬ 
ing  cumulative  economic  risk  when  examining  the  potential  effects 
of  classroom  quality,  teacher  behavior,  and  interventions  with 
at-risk  children.  Building  on  work  showing  that  accumulated  risk 
factors  predict  adjustment  outcomes  (e.g.,  Burchinal  et  al.,  2008; 
Crosnoe  &  Cooper,  2010;  Sameroff  et  al.,  1998),  our  results 
suggest  that  the  accumulation  of  several  key  economic  risk  factors 


(i.e.,  income  below  the  poverty  cutoff,  low  parent  education  levels, 
single-parent  family  status)  creates  a  level  of  stress  that  may 
prevent  children  from  accessing  or  benefitting  from  features  of 
high-quality  classrooms  that  have  appeared  protective  for  other 
children.  Although  a  relatively  narrow  intervention  focus  on  class¬ 
room  quality  may  be  attractive  because  that  environment  may  be 
more  amenable  to  intervention  efforts,  our  results  suggest  that  the 
effectiveness  of  a  more  supportive  environment  may  depend  on 
children’s  regulatory  abilities  when  considered  in  the  context  of 
cumulative  economic  risk.  It  may  also  be  more  effective  to  incor¬ 
porate  home  and  parenting  environments  into  intervention  proto¬ 
cols.  Work  by  Crosnoe  et  al.  (2010)  shows  that  high-quality 
interactions  in  both  the  home  and  childcare  or  school  settings  were 
necessary  to  mitigate  reading  deficits  for  low-income  children. 

In  terms  of  practice,  our  results  complement  other  work  showing 
the  value  of  children’s  regulation  for  their  lifelong  success  (Bier- 
man  et  al.,  2008;  McClelland  et  al.,  2007;  Moffitt,  Poulton,  & 
Caspi,  2013).  Here,  children  with  high  temperamental  regulation 
were  predicted  to  perform  better  on  math  and  literacy  assessments 
than  their  less  regulated  peers  when  in  classrooms  with  higher 
levels  of  instructional  support  and  organization.  This  indicates 
that,  in  addition  to  efforts  to  improve  the  quality  of  interactions 
between  children  and  teachers  for  promoting  positive  childhood 
outcomes,  a  focus  on  building  children’s  regulatory  skills,  partic¬ 
ularly  for  children  at  the  highest  levels  of  cumulative  risk,  may  be 
a  promising  avenue  for  interventions  designed  to  narrow  the  in¬ 
come  achievement  gap. 
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This  study  explored  associations  between  students’  perceptions  of  challenge,  teacher-provided  support 
and  obstruction,  and  students’  momentary  academic  engagement  in  high  school  science  classrooms. 
Instrumental  and  emotional  dimensions  of  support  and  obstruction  were  examined  separately,  and 
analyses  tested  whether  the  relationship  between  challenge  and  engagement  was  moderated  by  teacher 
support,  teacher  obstruction,  and  individual  characteristics  like  gender  and  grade  level.  Students’ 
perceptions  of  challenge  were  positively  related  to  their  momentary  reports  of  engagement  in  science 
learning  activities,  while  teachers’  instrumental  support  was  positively  associated  with  engagement 
across  all  levels  of  perceived  challenge.  Even  though  teachers’  provision  of  emotional  support  was  not 
predictive  of  student  engagement,  teachers’  emotional  obstruction  was  negatively  associated  with  student 
engagement.  Teachers’  instrumental  obstruction  had  less  consistent  associations  with  student  engage¬ 
ment,  and  was  only  associated  with  declines  in  engagement  during  those  moments  when  students 
perceived  greater  challenge  in  class.  Both  gender  and  grade  level  emerged  as  moderators  of  the 
relationship  between  challenge  and  engagement.  Results  are  discussed  in  terms  of  implications  for  future 
research  and  instructional  practice. 

Keywords:  teacher  support,  teacher  obstruction,  academic  challenge,  student  engagement,  science 
education 


By  both  student  and  researcher/observer  reports,  student  en¬ 
gagement  in  high  school  science  classrooms  tends  to  be  low 
(Shumow  &  Schmidt,  2014),  despite  the  tremendous  potential  for 
engagement  that  exists  through  hands-on  activities,  novelty,  and 
discovery.  Although  motivation  and  engagement  drop  throughout 
high  school  in  most  subject  areas,  the  drop  tends  to  be  sharper  for 
science  than  for  any  other  subject  (Gottfried,  Fleming,  &  Gottfried, 
2001).  Researchers  have  speculated  that  science  instruction  often 
fails  to  meet  its  potential  for  student  engagement  because,  despite 
long-standing  advocacy  for  experiential  and  inquiry-based  learn¬ 
ing  approaches,  lecture  continues  to  be  the  predominant  mode  of 
instruction  in  high  school  science  classrooms  (McLaughlin  & 
Talbert,  2001;  Nolen,  2003;  Shemoff,  Knauth,  &  Makris,  2000; 
Shumow  &  Schmidt,  2014;  Tobin,  1987;  Tobin  &  Gallagher, 
1987).  Given  that  science,  technology,  engineering,  and  mathe- 
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matics  (STEM)  education  is  a  national  priority  in  the  United 
States,  understanding  how  to  engage  students  in  science  is  criti¬ 
cally  important.  Motivational  variables,  such  as  perceptions  of 
competence  and  task  value,  have  emerged  as  more  predictive  of 
high  school  students’  situational  engagement  in  science  than  then- 
cognitive  abilities  in  science  (Lau  &  Roeser,  2002,  2008).  This 
research  underscores  the  importance  of  understanding  factors  be¬ 
yond  the  development  of  specific  cognitive  skills  in  improving 
science  education. 

Researchers  suggest  that  engagement  is  important  for  sustaining 
academic  resilience  (Finn  &  Rock,  1997),  which  in  turn  enhances 
achievement  (Appleton,  Christenson,  &  Furlong,  2008;  Connell, 
Spencer,  &  Aber,  1994;  Shemoff  &  Schmidt,  2008;  Sirin  & 
Rogers-Sirin,  2004;  Skinner,  Wellborn,  &  Connell,  1990).  Simi¬ 
larly,  high  engagement  is  related  to  positive  school  experiences, 
which  lead  to  optimal  human  development  (Csikszentmihalyi, 
1990;  Larson,  2000). 

According  to  a  national  study,  one  of  the  main  reasons  students 
provide  for  dropping  out  of  high  school  is  a  lack  of  positive  and 
meaningful  relationships  with  adults  in  school.  Students  from  this 
same  study  also  indicated  a  desire  to  be  more  intellectually  chal¬ 
lenged  at  school  (Yazzie-Mintz,  2010).  Despite  the  plethora  of 
studies  on  effective  instructional  behaviors,  few  have  examined  the 
direct  and  interactive  roles  of  students’  perception  of  challenge  and 
teachers’  provision  of  support  as  shapers  of  high  school  students’ 
engagement.  Thus,  these  two  constructs — challenge  and  support — 
need  to  be  considered  in  tandem  to  determine  their  unique  roles  in 
promoting  student  engagement. 

The  purpose  of  this  study  is  to  explore  the  associations  between 
students’  perception  of  challenge,  teacher-provided  support  and 
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obstruction,  and  students’  momentary  academic  engagement  in 
high  school  science  classrooms.  Specifically,  the  study  examines 
the  following  research  questions:  (a)  To  what  degree  do  students 
experience  challenge,  teacher  support,  and  teacher  obstruction  in 
their  science  classes?;  (b)  Are  students’  perceptions  of  challenge 
and  teachers’  supportive  and  obstructive  behaviors  associated  with 
high  school  students’  momentary  engagement  in  science?;  and  (c) 
Are  the  associations  specified  in  the  second  research  question 
moderated  by  student  gender  or  grade  level? 

Defining  Student  Engagement 

Following  recent  calls  to  conceptualize  engagement  as  a  multi¬ 
dimensional  construct  (e.g.,  Fredricks,  Blumenfeld,  &  Paris,  2004) 
we  define  engagement  in  the  current  study  as  consisting  of  both 
cognitive  (e.g.,  concentration,  effort  in  an  activity)  and  affective 
(e.g.,  enjoyment,  interest)  elements.  We  conceptualized  engage¬ 
ment  in  accord  with  Csikszentmihalyi’s  (1990)  Emergent  Motiva¬ 
tion  Theory,  which  posits  that  engagement  is  comprised  of  simul¬ 
taneous  experiences  of  concentration  and  cognitive  investment,  as 
well  as  interest  and  enjoyment  in  a  task. 

A  central  tenet  of  Emergent  Motivation  Theory  is  that  motiva¬ 
tional  processes  emerge  as  a  function  of  the  context,  through 
students  interacting  within  the  social  and  academic  environment 
(Csikszentmihalyi,  1990;  Csikszentmihalyi  &  Larson,  1984).  Stu¬ 
dent  engagement  is  thus  viewed  as  variable  and  malleable,  rather 
than  as  a  trait-like  personal  characteristic  (Urdan  &  Schoenfelder, 
2006).  In  classroom  environments,  the  teacher  plays  a  prominent 
role  in  creating  the  conditions  for  engagement  and  learning  to 
occur.  This  is  evident  through  the  choices  the  teacher  makes 
regarding  the  instructional  behaviors  he  or  she  engages  in  during 
the  lesson  (i.e.,  support  and  obstruction),  and  through  the  degree  to 
which  the  chosen  content  challenges  students  to  engage.  As  such, 
teacher  instructional  behaviors  can  foster  or  hinder  student  engage¬ 
ment  at  multiple  time  points  throughout  a  lesson.  Thus,  it  is 
important  to  identify  and  understand  those  classroom  practices  and 
conditions  that  are  proximally  associated  with  students’  engage¬ 
ment. 

Perceived  Challenge 

Challenge  is  widely  viewed  as  critical  for  student  engagement 
and  achievement  (Alexander,  Entwisle,  &  Horsey,  1997;  Planty  et 
al.,  2009).  Research  highlighting  student  disengagement  in  modem 
schools  identifies  challenge  as  a  possible  antidote  to  this  problem 
(Shernoff  et  ah,  2003;  Yazzie-Mintz,  2010).  In  the  present  study, 
we  defined  challenge  as  the  perception  by  a  student  that  the 
activity  at  hand  calls  for  some  type  of  cognitive  or  physical 
investment  (Csikszentmihalyi,  1990;  Shumow  &  Schmidt,  2014; 
see  also  Webster’s  New  World  College  Dictionary,  n.d.).  The 
experience  of  challenge  is  subjective:  whether  a  student  perceives 
an  activity  as  challenging  depends,  to  a  large  extent,  on  his  or  her 
actual  and  perceived  level  of  skill.  As  such,  an  activity  that  is 
perceived  as  challenging  to  one  student  may  not  be  challenging  to 
another. 

Generally  speaking,  researchers  have  found  that  perceived  chal¬ 
lenge  is  positively  associated  with  students’  engagement 
(Fredricks,  Blumenfeld,  Friedel,  &  Paris,  2002;  Lutz,  Guthrie,  & 
Davis,  2006;  Hektner,  2001;  Shernoff  et  al.,  2003).  However,  the 


way  students  interpret  challenge  may  vary  from  one  learning 
context  to  another.  Students  may  be  motivated  by  challenge  in 
certain  types  of  learning  contexts,  but  may  feel  threatened  by  it  in 
others  (Csikszentmihalyi,  1990;  Schmidt,  Kackar,  &  Strati,  2010). 
For  instance,  when  students  are  challenged  in  a  learning  context 
that  is  characterized  by  supportive  teacher  instructional  behaviors, 
they  may  be  motivated  by  the  challenge,  leading  to  higher  engage¬ 
ment.  Conversely,  we  might  expect  that  when  students  experience 
academic  challenge  in  contexts  where  instructional  behaviors  are 
obstructive,  they  may  feel  threatened,  rather  than  motivated  by 
challenge,  and  would  disengage  (Rathunde,  1996).  When  examin¬ 
ing  the  role  of  challenge  for  students,  it  is  important  to  consider 
features  of  the  context — like  support  and  obstruction — in  which 
challenge  is  experienced. 

Likewise,  there  may  be  individual  differences  in  students’  in¬ 
terpretation  of  challenge  as  motivating  or  threatening,  and  these 
individual  differences  may  also  impact  the  relationship  between 
challenge  and  engagement.  For  example,  females  report  greater 
science  anxiety  than  males  (Britner,  2008;  Mallow,  2010),  and  this 
negative  emotional  experience  may  dampen  the  motivational  prop¬ 
erties  of  challenge.  Thus,  we  examined  gender  as  a  potential 
moderator  of  the  relationship  between  challenge  and  engagement. 

For  a  variety  of  reasons,  grade  in  school  may  also  be  an 
important  factor.  First,  the  transition  to  high  school  is  often  fraught 
with  academic  difficulty,  loss  of  motivation,  and  frustration  (Bar¬ 
ber  &  Olsen,  2004;  Benner  &  Graham,  2009;  Eccles  &  Roeser, 
2009).  As  a  result  9th  graders  may  perceive  challenge  as  more 
threatening  than  older  students  who  have  had  the  chance  to  accli¬ 
mate  to  the  high  school  environment.  Second,  among  students  with 
college  aspirations,  11th  grade  is  often  seen  as  a  critical  year,  as  it 
is  the  last  chance  to  shine  on  college  transcripts,  often  followed  by 
a  “senior  slump”  in  12th  grade  (Kirst,  2001):  Thus,  we  might 
expect  1 1th  graders  to  respond  to  challenge  with  deeper  engage¬ 
ment  in  an  effort  to  put  their  best  foot  forward  in  the  college 
application  process. 

Teacher  Support  and  Obstruction 

Studied  under  many  labels  (i.e.,  pedagogical  caring,  Wentzel, 
1998;  relatedness,  Deci  &  Ryan,  2000;  classroom  belonging, 
Goodenow,  1993)  researchers  have  used  a  wide  and  inconsistent 
range  of  frameworks  to  characterize  the  intellectual  and  emotional 
bonds  that  exist  between  teachers  and  students  (Davis,  2003). 
Consistent  with  recent  work  that  views  teacher  support  as  mallea¬ 
ble  (Gehlbach,  Brinkworth,  &  Harris,  2012)  and  multidimensional 
(Anderman,  Andrzejewski,  &  Allen,  2011)  we  view  teacher  sup¬ 
port  as  consisting  of  both  instrumental  and  emotional  dimensions 
(Federici  &  Skaalvik,  2014;  Malecki  &  Demaray,  2003;  Semmer, 
Elfering.  Jacobshagen,  Perrot,  Beehr,  &  Boos,  2008).  Additionally, 
we  conceptualize  the  antithesis  of  support — obstruction — as  hav¬ 
ing  instrumental  and  emotional  dimensions  as  well. 

Instrumental  Support 

Instrumental  support  refers  to  teacher  scaffolding  that  helps 
students  progress  with  the  academic  task  at  hand  (e.g.,  using 
structured  questions,  providing  appropriate  materials  and  feed¬ 
back).  Empirical  studies  considering  the  role  of  instrumental  sup¬ 
port  on  student  engagement  are  scant.  However,  researchers  have 
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started  to  recognize  the  differential  role  that  this  type  of  support 
might  play  in  shaping  students’  academic  beliefs  and  behaviors 
(Federici  &  Skaalvik,  2014;  Malecki  &  Demaray,  2003;  Suldo, 
Friedrich,  White,  Farmer,  Minch,  &  Michalowski,  2009).  For 
example,  instrumental  support  defined  as  students’  perception  of 
tangible  support  (e.g..  Teachers  take  time  to  help  me  learn  to  do 
something  well.  )  is  related  to  students’  reported  subjective  well¬ 
being  (Suldo  et  al.,  2009),  intrinsic  motivation,  and  help-seeking 
behavior  (Federici  &  Skaalvik,  2014). 

Emotional  Support 

Emotional  support  refers  to  the  degree  to  which  the  teacher 
encourages,  accepts,  respects,  and  trusts  students,  as  well  as  the 
degree  to  which  he  or  she  demonstrates  caring  for  students’  emo¬ 
tional  well-being  and  conveys  confidence  in  their  abilities  to  fulfill 
classroom  requirements  successfully.  There  is  compelling  empir¬ 
ical  evidence  of  the  positive  relationship  between  teacher  emo¬ 
tional  support  and  students’  academic  engagement  (Brewster  & 
Bowen,  2004;  Davidson  &  Phelan,  1999;  Garcia-Reid,  2007; 
Green,  Rhodes,  Hirsch,  Suarez-Orozco,  &  Camic,  2008;  Hughes  & 
Kwok,  2007;  Marks,  2000;  Murdock,  1999;  Murray,  2009;  Patrick, 
Ryan,  &  Kaplan,  2007;  Ryan  &  Patrick,  2001;  Ryan,  Stiller,  & 
Lynch,  1994;  Sharkey,  You,  &  Schnoebelen,  2008).  These  positive 
relationships  have  been  observed  in  research  involving  both  stu¬ 
dents’  and  teachers’  reports  of  support  (Klem  &  Connell,  2004).  In 
longitudinal  studies,  children’s  reports  of  relatedness  in  the  class¬ 
room  and  security  with  teachers  are  associated  with  changes  in 
engagement  over  a  number  of  years  (Furrer  &  Skinner,  2003; 
Tucker  et  al.,  2002).  Some  have  noted  a  bidirectional  relationship 
where  engaged  students  receive  more  support  from  teachers  and 
supportive  teachers  have  more  engaged  students  (Connell  &  Well¬ 
born,  1991;  Furrer  &  Skinner,  2009;  Skinner  &  Belmont,  1993). 
For  reviews  of  this  literature,  the  reader  is  referred  to  Martin  and 
Dowson  (2009);  Osterman  (2000),  or  Pekrun  and  Linnenbrink- 
Garcia  (2012).  It  is  noteworthy  that  the  vast  majority  of  research 
on  teacher  support  (whether  it  be  emotional  or  instrumental)  has 
focused  on  elementary  school  populations.  A  unique  contribution 
of  the  present  study  is  to  examine  these  constructs  in  the  context  of 
high  school  science  classrooms. 

Instrumental  Obstruction 

Through  their  behaviors,  teachers  can,  at  times,  also  create 
barriers  to  student  learning  and  engagement,  rather  than  providing 
support  (Suldo  et  al.,  2009).  In  the  present  study,  obstruction 
represents  more  than  the  mere  absence  of  overt  support.  Instead,  it 
refers  to  specific  teacher  instructional  behaviors  that  directly 
thwart  students  instrumentally  or  emotionally  (Turner  &  Patrick, 
2004).  It  is  worth  noting  that  teacher  behaviors  that  are  potentially 
thwarting  to  students  are  often  not  intended  as  such  by  the  teach¬ 
ers.  Obstruction  can  result  from  relatively  well-intentioned  teacher 
behaviors  (such  as  sarcastic  teasing)  or  can  stem  from  the  realities 
of  teachers’  limited  attentional  resources  that  often  translate  into 
neglect  of  some  students. 

In  this  sense,  instrumental  obstruction  refers  to  teacher  instruc¬ 
tional  behaviors  that  demonstrate  a  tangible  undermining  of  stu¬ 
dents’  efforts,  or  a  failure  to  respond  to  obvious  student  bids  for 
assistance  in  situations  where  help  is  clearly  warranted.  This 


definition  of  teacher  obstruction  includes  both  purposeful  under¬ 
mining  and  overt  neglect.  The  few  existing  empirical  studies 
addressing  instrumental  obstruction  tie  negative  teacher  behaviors 
to  maladaptive  student  outcomes  (Birch  &  Ladd,  1997;  Blanke- 
meyer,  Flannery,  &  Vazsonyi,  2002;  Kesner,  2000;  Walsh,  2002) 
and  specifically  to  decreases  in  students’  motivation  (Meyer  & 
Turner,  2002;  Skinner  &  Belmont,  1993).  Instrumentally  obstruc¬ 
tive  behaviors  were  more  frequently  observed  among  teachers  who 
were  characterized  as  humorless,  boring,  self-centered,  unavail¬ 
able,  and  lacking  knowledge  or  competence  (Gorham  &  Chris- 
tophel,  1992).  Furthermore,  students  who  perceived  their  teachers 
as  frequently  providing  negative  feedback  reported  a  more  nega¬ 
tive  relationship  with  them  (Burnett,  2002). 

Emotional  Obstruction 

Emotional  obstruction  refers  to  teacher  disregard,  disrespect, 
sarcasm,  criticism,  threats,  and  negative  affect  toward  students. 
While  researchers  have  documented  the  positive  associations  be¬ 
tween  emotionally  supportive  teacher  behaviors  and  student  en¬ 
gagement,  they  have  largely  ignored  the  role  of  emotionally  ob¬ 
structive  behaviors.  One  reason  for  this  might  be  that  we  would 
expect  emotional  obstruction  to  occur  very  rarely.  Although  it  is 
fairly  easy  to  imagine  that  because  of  multiple  demands,  teachers 
might  unintentionally  be  instrumentally  obstructive  by  neglecting 
students  or  otherwise  creating  obstacles  to  their  academic  progress, 
teachers’  emotional  obstruction  seems  far  less  likely  to  occur. 
However,  researchers  have  found  that  students  tend  to  remember 
teachers’  negative  behaviors  more  vividly  than  their  positive  ones 
(Murdock,  1999;  Suldo  et  al.,  2009).  Teachers’  emotionally  ob¬ 
structive  behaviors,  while  presumably  infrequent,  may  be  far  more 
impactful  than  their  positive  ones.  We  emphasize  again  here  that 
teachers’  emotionally  obstructive  behaviors  need  not  necessarily 
be  intentionally  destructive.  For  example,  teasing  and  sarcasm  are 
styles  of  communication  that  may  be  well  intentioned  but  poten¬ 
tially  obstructive. 

Challenge,  Support,  and  Obstruction:  Considering 
Independent  and  Interactive  Effects 

The  research  reviewed  above  suggests  that  that  there  are  posi¬ 
tive  relationships  between  perceived  challenge  and  student  en¬ 
gagement  on  one  hand,  and  teacher  support  and  student  engage¬ 
ment  on  the  other.  There  is  a  dearth  of  research,  however, 
examining  both  perceived  challenge  and  teacher  support  simulta¬ 
neously,  and  there  is  even  less  research  on  teacher  obstruction. 
Consequently,  it  has  not  been  possible  to  assess  whether  teachers’ 
provision  of  support  or  obstruction  has  differential  impact  on 
student  engagement  depending  on  the  level  of  challenge  students 
perceive. 

Csikszentmihalyi  and  colleagues  have  argued  that  challenge 
may  be  necessary,  though  not  sufficient  to  precipitate  student 
engagement  (Csikszentmihalyi,  1990;  Moneta  &  Csikszentmi¬ 
halyi,  1996).  Thus,  it  may  be  important  to  consider  how  both 
contextual  factors  like  teacher  support  and  obstruction,  and  indi¬ 
vidual  factors  like  gender  and  age,  may  enhance  or  inhibit  the 
relationship  between  challenge  and  engagement.  The  current  study 
considers  the  independent  and  interactive  effects  of  perceived 
challenge  and  teacher  support/obstruction  on  student  engagement, 
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taking  into  account  both  the  instrumental  and  emotional  dimen¬ 
sions  of  teacher  support  and  obstruction.  The  simultaneous  con¬ 
sideration  of  challenge  and  support  or  obstruction,  respectively, 
will  provide  a  more  nuanced  understanding  of  the  personal  and 
situational  factors  that  underlie  student  engagement  in  high  school 
science  classrooms. 

Hypotheses 

Drawing  from  the  literature  just  reviewed,  we  developed  several 
specific  hypotheses  to  test  in  our  analysis. 

Hypothesis  1:  Perceived  challenge,  teacher  instrumental  sup¬ 
port,  and  teacher  emotional  support  will  each  be  positively 
associated  with  student  engagement  in  science. 

Hypothesis  2:  The  positive  relationship  between  challenge  and 
engagement  will  be  stronger  in  the  presence  of  teacher  instru¬ 
mental  and  emotional  support. 

Hypothesis  3:  The  positive  relationship  between  challenge  and 
engagement  will  be  less  strong  for  females  relative  to  males. 

Hypothesis  4:  The  association  between  challenge  and  engage¬ 
ment  will  be  stronger  in  higher  grades. 

Hypothesis  5:  Teacher  instrumental  obstruction  and  teacher 
emotional  obstruction  will  each  be  negatively  associated  with 
student  engagement  in  science. 

Hypothesis  6:  The  positive  relationship  between  challenge  and 
engagement  will  be  weaker  in  the  presence  of  teacher  instru¬ 
mental  and  emotional  obstruction. 

Hypotheses  1  and  5  test  for  relationships  between  challenge, 
support,  obstruction,  and  engagement.  As  we  mentioned  in  our 
review  of  the  literature,  many  of  these  relationships  have  already 
been  established  in  prior  research,  but  the  vast  majority  of  this 
work  has  focused  on  elementary  school  populations.  Thus,  the 
value  of  examining  these  hypotheses  here  is  to  explore  whether 
prior  findings  involving  teacher  support  at  the  elementary  level  are 
replicated  in  secondary  science  classrooms.  Beyond  this,  the  sub¬ 
stantively  unique  contribution  of  this  research  involves  the  explo¬ 
ration  of  the  situational  (e.g.,  teacher  support  and  obstruction)  and 
personal  factors  (e.g.,  gender  and  grade  level)  that  moderate  the 
relationship  between  challenge  and  engagement  (Hypotheses  2,  3, 
4,  and  6). 

Method 

Data  Source  and  Participants 

Data  used  in  this  analysis  were  collected  in  2008-2009  in  a 
single  comprehensive  high  school  serving  students  from  a  diverse 
community  located  on  the  fringe  of  a  large  metropolitan  area 
(Schmidt  &  Smith,  2008).  The  school  serves  9th  through  12th 
graders,  with  an  enrollment  of  approximately  3,300  in  2009.  For 
the  current  study,  we  used  data  from  student  surveys,  classroom 
video  footage  and  repeated  student  self-reports  in  high  school 
science  classrooms.  There  were  223  students  (54%  male)  within  1 1 
classrooms:  3  classrooms  each  in  the  areas  of  biology,  chemistry. 


and  physics,  and  2  classrooms  in  9th-grade  general  science.1  Forty 
percent  of  the  students  were  White,  40.4%  Hispanic,  1 1.8%  Black, 
5%  multiracial,  2.3%  Asian  or  Pacific  Islander,  and  less  than  1% 
Native  American. 

All  1 1  classroom  teachers,  5  male  and  6  female — were  White. 
Demographic  characteristics  of  students  and  teachers  are  displayed 
in  Table  1. 

Procedures  and  Measures 

Data  were  collected  using  ijiultiple  methods  over  two  time 
periods  (“waves”)  during  the  academic  year — 5  consecutive  days 
in  fall  and  5  days  in  spring.  Data  from  different  sections  of  the 
same  course  were  collected  during  the  same  time  period.  In  this 
way,  the  data  collected  from  all  sections  of  a  particular  course 
represent  the  same  point  in  the  science  curriculum  for  that  course, 
thus  enabling  analysis  of  the  role  of  the  teacher  instructional 
behaviors  while  controlling  for  the  effects  of  particular  content 
units.  Additionally,  studying  two  different  content  units  (one  in  fall 
and  one  in  spring)  within  each  subject  area  reduces  the  possibility 
that  findings  were  idiosyncratic  and  entirely  attributable  to  the 
specific  unit  examined. 

Experience  Sampling  Method 

During  both  waves  of  data  collection,  participants  also  com¬ 
pleted  the  Experience  Sampling  Method  (ESM;  Csikszentmihalyi 
&  Larson,  1987)— a  signal  contingent  method  of  data  collection  in 
which  participants  reported  on  subjective  dimensions  of  engage¬ 
ment  in  response  to  signals  from  a  vibrating  pager  worn  during 
science  class  (see  Hektner,  Schmidt,  &  Csikszentmihalyi,  2007  for 
a  review).  To  minimize  the  disruption  to  class  flow  and  maximize 
the  variety  of  classroom  activities  recorded,  each  day  the  partici¬ 
pants  in  each  classroom  were  randomly  assigned  to  one  of  two 
signaling  groups,  with  each  group  following  a  different,  randomly 
generated  signaling  schedule.  Two  sets  of  two  signals  were  trans¬ 
mitted,  meaning  that  individual  students  received  only  two  ESM 
signals  per  lesson  but  ESM  data  were  collected  on  four  occasions 
per  lesson.  The  random  signaling  schedules  were  generated  using 
a  computer  program  following  procedures  outlined  in  Hektner, 
Schmidt,  and  Csikszentmihalyi,  (2007).  The  two-group  signaling 
design  was  necessary  to  compute  accurate  estimates  of  the  vari¬ 
ance  in  engagement  that  is  related  to  particular  instructional  epi¬ 
sodes.  Following  each  signal,  students  completed  an  Experience 
Sampling  Form  (ESF)  in  which  they  briefly  recorded  their  activ¬ 
ities  and  thoughts  at  the  time  of  the  signal,  as  well  as  various 
dimensions  of  their  subjective  experience.  The  ESF  took  approx¬ 
imately  1-2  min  to  complete.  By  the  study’s  completion,  each 
participant  had  been  signaled  twice  per  lesson,  for  10  lessons, 
resulting  in  up  to  20  reports  of  their  experiences.  A  total  of  3,803 
ESM  responses  were  generated  during  the  10  days  of  data  collec¬ 
tion,  representing  an  85%  response  rate  to  the  ESM.  Missing  ESM 
data  were  almost  entirely  attributable  to  school  absence.  A  debrief¬ 
ing  questionnaire  administered  to  all  student  participants  indicated 


1  The  larger  study  included  three  classrooms  in  9th  grade  general  sci¬ 
ence.  Because  of  staffing  changes,  the  original  teacher  in  one  of  the  general 
science  classrooms  was  given  a  new  assignment  and  was  replaced.  Because 
of  this  change,  data  from  this  classroom  were  not  used  in  the  current  study. 
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Student  participants 
(N  =  220) 

%  Students 

Teacher  participants  (N  =  11) 

Number  of 
teachers 

Sex 

Sex 

Male 

54 

Male 

5 

Female 

46 

Female 

6 

Race/ethnicity 

Race 

Hispanic 

40.4 

White 

11 

White 

40.0 

National  board  certification 

3 

Black 

11.8 

Education  level  completed 

Multi-racial 

5.0 

Four  year  college  degree 

4 

Asian/Pacific  Islander 

2.3 

Master’s  degree 

6 

American  Indian 

.5 

PhD,  or  other  advanced  degree 

1 

Subject 

General  science 

13 

Number  of  years 

Biology 

34 

Mean  age  (range  24-52) 

35.9 

Chemistry 

28 

Mean  years  of  teaching  experience  (range  2-19) 

8.3 

Physics 

25 

Less  experienced  teachers  (n  =  6;  range  2-7) 

4.3 

Grade  level 

More  experienced  teachers  (n  =  5;  range  8-19) 

13 

9th 

37 

10th 

23 

11th 

37 

12th 

3 

Free/reduced  lunch 

41 

Parent  education 

High  school  or  less 

39 

Some  college 

10 

Graduated  from  College 

24 

Advanced  degree 

13 

Do  not  know 

14 

that  the  data  collection  method  was  not  perceived  as  particularly 
disruptive:  on  a  scale  of  1-4,  with  1  =  not  at  all  disruptive  and 
4  =  very  disruptive,  the  mean  rating  was  just  under  2  ( only  a  little 
disruptive). 

Academic  engagement  (outcome  measure).  At  each  ESM 
signal  students  rated  their  concentration  (i.e.,  As  you  were  signaled 
how  well  were  you  concentrating?),  effort  (i.e.,  How  hard  were 
you  working?),  interest  (i.e..  Was  this  activity  interesting?),  and 
enjoyment  (i.e.,  Did  you  enjoy  what  you  were  doing?)  in  science 
on  a  4-point  scale,  from  0  (not  at  all)  to  3  (very  much).  Four-point 
scales  such  as  this  one  are  recommended  for  providing  a  reason¬ 
able  spread  of  responses  while  minimizing  measurement  error  that 
can  result  from  the  use  of  more  response  categories  (McCoach, 
Gable,  &  Madura,  2013).  The  mean  of  responses  to  these  four 
questions  was  taken  to  form  a  measure  of  engagement.  Internal 
consistency  of  this  measure,  as  indicated  by  Cronbach’s  a  was  .79. 
The  measure  assesses  both  cognitive  and  affective  dimensions  of 
engagement;  dimensions  that  have  been  widely  studied  in  the 
literature  (see  Fredricks,  Blumenfeld,  &  Paris,  2004;  Conner  & 
Pope,  2013  for  reviews),  and  is  consistent  with  measures  used  by 
other  researchers  (see  Shernoff  et  al.,  2003;  Shemoff  &  Csikszent- 
mihalyi,  2009).  As  a  signal-level  indicator,  the  engagement  mea¬ 
sure  is  variable  across  both  persons  and  instructional  episode  (see 
description  of  instructional  episodes). 

Perceived  challenge.  At  each  ESM  signal  students  indicated 
how  challenged  they  felt  on  a  4-point  scale,  from  0  (not  at  all)  to 
3  (very  much).  As  a  signal-level  indicator,  the  challenge  rating  is 
variable  across  both  persons  and  instructional  episode. 


Video  Coding  of  Instructional  Episodes 

During  each  day  of  ESM  signaling,  a  trained  videographer 
recorded  classroom  activities  for  each  50-min  science  lesson,  with 
a  focus  on  teacher  behaviors.  Videos  were  marked  to  indicate 
when  students  responded  to  the  pager  signals,  so  that  video  data  on 
teacher  behavior  could  be  matched  to  students’  subjective  ESM 
ratings  of  their  classroom  experiences.  This  unique  methodology 
enables  researchers  to  examine  the  ways  that  students’  momentary 
subjective  experience  in  classrooms  is  related  to  specific  teacher 
practice.  This  information  is  valuable  to  both  practitioners  and 
researchers  in  that  it  can  be  used  to  identify  specific  classroom 
conditions  that  foster  student  engagement,  isolating  independent 
and  interactive  effects. 

In  total,  73.33  hr  of  video  were  collected  across  the  11  teachers. 
Teachers’  video-recorded  interactions  with  their  students  were 
then  coded  in  terms  of  the  instrumental  and  emotional  dimensions 
of  teacher  support  and  obstruction.  Following  an  extensive  review 
of  existing  procedures  for  observing  and  coding  teacher  behaviors, 
a  coding  scheme  was  developed  based  on  Turner  and  Patrick’s 
(2004)  work  on  supportive  and  nonsupportive  teacher  behaviors. 
Consistent  with  other  widely  used  observation  protocols  like  Pi- 
anta  and  colleagues’  classroom  assessment  scoring  system 
(CLASS)  instrument,  (Allen,  Gregory,  Mikami,  Lun,  Hamre,  & 
Pianta,  2013;  La  Paro  &  Pianta,  2003).  the  current  coding  scheme 
makes  conceptual  and  operational  distinctions  between  instruc¬ 
tional  and  emotional  discourse  patterns,  and  identifies  similar 
teacher  behaviors  as  being  supportive  (or  nonsupportive).  Our 
coding  procedures  diverge  from  Pianta  and  colleagues’  in  one 
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important  respect,  however.  In  the  CLASS  instrument  the  opera¬ 
tional  definition  of  instructional  (or,  to  use  our  term,  instrumental) 
support  includes  challenge  as  a  dimension  of  support.  In  our 
framework,  challenge  is  conceptualized  as  a  subjective  perception 
that  is  being  considered  independently  of  teachers’  instrumentally 
supportive  behaviors.  This  conceptual  and  operational  distinction 
allows  for  individual  variation  in  the  measurement  of  challenge, 
and  enables  us  to  test  the  hypothesis  that  perceived  challenge  may 
have  a  different  impact  on  engagement  depending  on  teachers' 
affordances  of  emotional  or  instrumental  support  or  obstruction. 

In  these  video  data,  approximately  60%  of  teachers’  interactions 
with  their  students  were  directed  to  the  entire  class.  In  addition, 
teacher  comments  directed  toward  individuals  or  small  groups  of 
students  were  generally  not  made  privately,  and  thus,  could  be 
observed  by,  and  potentially  impact  all  students  in  the  class,  even 
though  the  intended  recipient  of  the  remark  was  a  specific  student 
(Schmidt,  Zaleski,  Shumow,  Ochoa- Angrino,  &  Hamidova,  2011). 
The  most  common  example  of  this  occurred  during  whole  class 
discussions,  where  teachers  would  direct  comments  that  were 
supportive  or  obstructive  to  specific  students,  but  these  comments 
were  made  in  front  of  the  entire  class.  Because  the  teachers’ 
interactions  with  individual  students  were  observable  by  others,  we 
chose  to  code  all  student-teacher  interactions  in  terms  of  support 
and  obstruction,  reasoning  that  any  supportive  or  obstructive  in¬ 
teraction — even  those  with  individual  students — could  potentially 
impact  the  engagement  of  all  students  in  the  class. 

Because  of  the  highly  interactive  nature  of  teachers’  instruc¬ 
tional  behaviors,  we  included  teachers’  gesture  and  tone  in  the 
coding  scheme.  This  was  deemed  especially  important  in  the 
coding  of  teacher  enthusiasm  or  humor  (aspects  of  emotional 
support)  and  teacher  sarcasm  (one  aspect  of  emotional  obstruc¬ 
tion). 

Instructional  episodes.  For  all  lessons,  the  10-min  segments 
of  video  before  each  ESM  signal  were  coded  for  teachers’  affor¬ 
dances  of  instrumental  and  emotional  support  and  obstruction 
using  the  NViVo  software  package.  These  10-min  segments  are 
hereafter  referred  to  as  “instructional  episodes”  (IEs):  A  total  of 
440  IEs  were  identified  and  coded  (4  episodes  per  class  period  X 
10  class  periods  X  11  classrooms).  Within  each  IE,  teacher  in¬ 
structional  behaviors  were  coded  for  the  presence  or  absence  of 
each  type  of  support  and  obstruction:  Thus,  the  support  and  ob¬ 
struction  indicators  were  constructed  to  be  episodic  indicators 
rather  than  global  teacher  characteristics,  meaning  that  the  same 
teacher  could  be  coded  as  supportive  in  one  instructional  episode 
and  not  supportive  in  another.  Because  of  methodological  and 
conceptual  difficulties  in  quantifying  the  magnitude  of  support  or 
obstruction  evidenced  in  each  IE,  we  did  not  attempt  to  measure 
the  degree  of  support  or  obstruction  within  IEs,  but  rather  assigned 
dummy  codes  to  indicate  the  presence  of  observable  instrumental 
support,  emotional  support,  instrumental  obstruction,  and  emo¬ 
tional  obstruction  within  each  IE.  Percent  agreement  between  two 
independent  coders  (conducted  on  10%  of  the  videos)  was  .84,  .82, 
.93,  and  .95  for  instrumental  support,  emotional  support,  instru¬ 
mental  obstruction,  and  emotional  obstruction,  respectively.  Given 
that  each  IE  was  coded  for  presence/absence  of  the  support  and 
obstruction  constructs,  percent  interrater  agreement  is  an  appro¬ 
priate  indicator  of  reliability  (Tinsley  &  Weiss,  1975,  2000). 

Teacher  instrumental  support.  The  presence  of  instrumental 
support  was  coded  when,  at  any  point  within  a  given  IE,  the 


teacher  was  observed  scaffolding  students’  efforts  in  science- 
related  endeavors  in  a  way  that  helped  students  successfully 
progress  with  the  task  at  hand  (e.g.,  provided  materials,  feed¬ 
back,  and  helped  the  students  understand  the  assignment 
through  structured  questions).  An  example  from  the  video  data 
is  when  a  teacher  helped  students  understand  the  content  by 
physically  demonstrating  the  concept  in  class  or  by  drawing  it 
on  the  board.  At  the  end  of  the  demonstration  the  teacher  asked. 

“ Does  that  help?” 

Teacher  emotional  support.  The  presence  of  emotional  sup¬ 
port  was  coded  when,  at  any  point  within  a  given  IE,  the  teacher 
encouraged,  respected,  cared,  and  conveyed  confidence  and  ex¬ 
citement  in  students’  abilities  to  fulfill  classroom  requirements 
successfully.  Sample  comments  from  the  video  data  include  “ I’m 
basking  in  the  glow  of  your  brilliance ”  or  “Awesome,  you  guys  are 
awesome,”  or — “Yes,  Excellent.  That’s  a  great  idea.” 

Teacher  instrumental  obstruction.  Occasions  when  the 
teacher  explicitly  undermined  the  students’  efforts  or  failed  to 
respond  to  obvious  student  bids  for  assistance  were  coded  as 
instrumental  obstruction.  For  example,  we  observed  multiple 
versions  of  a  scenario  in  which  a  student  would  raise  his  or  her 
hand  in  obvious  need  of  clarification  or  guidance,  and  the 
teacher  would  fail  to  notice  or  acknowledge  this  student  for  a 
long  period  of  time.  The  student  would  then  appear  frustrated, 
lower  his  or  her  hand,  and  then  appear  disinterested  or  off-task. 
A  second  example  of  instrumental  obstruction  occurred  when  a 
teacher  had  not  tested  materials  ahead  of  time  for  a  lesson  about 
chemical  reactions,  with  the  result  that  students  who  followed 
the  procedures  accurately  did  not  achieve  the  intended  result 
(i.e.,  the  reactants  did  not  react).  Another  example  is  when  a 
teacher  made  an  error  in  printing  copies  of  a  test  with  the  result 
that  test  items  did  not  easily  correspond  to  the  item  numbers  on 
the  scan-tron  sheets  students  had  to  use  to  record  their  re¬ 
sponses.  The  instrumental  obstruction  indicator  captured  a  va¬ 
riety  of  situations  in  which  teachers’  poor  planning  or  negli¬ 
gence  created  logistical  disruptions  to  students  completing  their 
work.  In  such  instances,  teachers  were  often  not  intending  to  be 
disruptive,  and  may  have  been  responding  in  the  best  way  they 
could  to  the  multiple  demands  of  teaching,  but  such  instances 
were  coded  as  obstructive  because  they  resulted  in  disruptions 
all  the  same. 

Teacher  emotional  obstruction.  Teacher  statements  that  ei¬ 
ther  attacked  or  neglected  students  emotionally  through  the  use  of 
undermining  or  sarcastic  statements  were  coded  as  emotionally 
obstructive.  Note  that,  in  their  use  of  such  statements,  teachers  may 
or  may  not  have  had  malicious  intent:  However,  given  the  nature 
of  sarcasm,  we  reasoned  that  the  potential  for  harm  exists.  Exam¬ 
ples  from  the  video  data  include  an  instance  when  a  teacher  said: 
“ You  know,  the  sad  thing  is  that  you  guys  know  this  stuff  but  your 
head  is  slow,”  or  a  separate  occasion  where  a  teacher  said  “ Come 
on  people,  you  are  boring  the  crap  out  of  me!” 

The  reader  should  note  that  our  coding  of  obstruction  (both 
instrumental  and  emotional)  combines  both  purposeful  undermin¬ 
ing  and  neglect,  which  may  be  conceptually  distinct  teacher  in¬ 
structional  behaviors.  While  we  would  have  liked  to  examine 
differential  effects  for  these  two  dimensions  of  obstructive  teacher 
behavior,  these  behaviors  occurred  relatively  infrequently  in  our 
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data,  and  therefore  we  were  reluctant  to  further  subdivide  these 
already  small  categories. 

Student  characteristics. 

Gender.  Student  gender  was  reported  in  the  student  survey. 
Gender  is  dummy-coded  as  female  in  all  analyses  (0  =  no,  1  = 
yes). 

Grade  level.  Grade  level  was  also  reported  in  the  student 
survey  and  refers  to  students’  educational  level  (9th  through  12th 
grade).  In  analyses,  grade  level  was  dummy  coded  using  12th 
grade  as  the  reference  group. 

Results 

Descriptive  Results  About  Engagement,  Challenge, 
Support,  and  Obstruction 

In  our  first  set  of  analyses  we  use  descriptive  statistics  to 
provide  accounts  of  the  degree  to  which  students  report  engage¬ 
ment  and  challenge  in  their  science  classroom,  and  the  degree  to 
which  researchers  observed  the  various  dimensions  of  teacher 
support  and  obstruction. 

Engagement.  Across  all  of  the  instructional  episodes  sam¬ 
pled,  students  reported  a  mean  engagement  level  of  1.56  ( SD  = 
.76)  on  the  0  to  3  rating  scale  on  the  ESM,  which  indicates  that  on 
average,  students  were  “somewhat”  engaged  during  science  in¬ 
struction. 

Challenge.  While  there  was  substantial  variation  in  students’ 
ESM  ratings  of  challenge,  the  mean  challenge  rating  across  all 
students  and  all  instructional  episodes  was  quite  low  (M  =  0.89, 
SD  =  0.93  on  a  0  to  3  rating  scale).  This  indicates  that  on  average, 
students  were  “a  little”  challenged  during  science  instruction. 

Support  and  obstruction.  The  frequency  with  which  each  of 
the  support  and  obstruction  indicators  was  observed  in  the  instruc¬ 
tional  episodes  captured  in  the  video  data  is  reported  in  Table  2.  In 
this  table  we  also  report  the  mean  and  SD  of  student-reported 
engagement  within  each  of  the  support  and  obstruction  conditions. 
Across  the  440  IEs  coded,  instrumental  support  was  observed 
frequently,  occurring  in  84%  of  all  IEs.  Emotional  support  was 
also  observed  often,  though  less  frequently  than  instrumental  sup¬ 
port,  occurring  in  69%  of  all  IEs.  Instrumental  obstruction  was 

Table  2 


Observed  Frequency  of  Instructional  Behaviors  and  Engagement 
Across  Instructional  Episodes 


Instructional  behavior 

Instructional 

episodes 

%  of  total 

Engagement 
Mean  (SD) 

Instrumental  support 

Not  observed 

68 

15.5 

1.47  (.73) 

Observed 

372 

84.5 

1.58  (.76) 

Emotional  support 

Not  observed 

137 

31 

1.59  (.76) 

Observed 

303 

69 

1.55  (.76) 

Instrumental  obstruction 

Not  observed 

334 

76 

1.54  (.76) 

Observed 

106 

24 

1.59  (.75) 

Emotional  obstruction 

Not  observed 

316 

72 

1.59  (.75) 

Observed 

124 

28 

1.49  (.77) 

Total 

440 

100 

observed  in  24%  of  all  IEs  and  emotional  obstruction  was  ob¬ 
served  in  28%  of  all  IEs. 

As  the  frequency  counts  suggest,  some  IEs  were  coded  as 
evidencing  both  instrumental  and  emotional  support  (this  was  the 
case  for  about  60%  of  IEs,  cf>  =  .  1 1,  p  <  .05).  For  this  reason,  our 
analyses  of  the  relationships  between  support  and  engagement 
included  both  the  instrumental  and  emotional  support  indicators  to 
test  for  independent  effects.  Likewise,  instrumental  and  emotional 
obstruction  were  included  in  the  same  models  to  test  for  indepen¬ 
dent  associations  because  they  also  did  co-occur  (in  about  10%  of 
IEs,  <|>  =  .14,  p  <  .01).  Although  it  was  theoretically  possible  for 
IEs  to  be  coded  as  having  both  supportive  and  obstructive  instruc¬ 
tional  behaviors,  the  actual  occurrence  of  this  in  the  data  was  rare 
(about  6%  of  IEs),  preventing  the  meaningful  inclusion  of  obstruc¬ 
tion  and  support  in  a  single  model.  Consequently,  analyses  are 
presented  separately  for  teacher  support  and  teacher  obstruction. 

The  descriptive  statistics  on  student  engagement  indicate  that, 
numerically  speaking,  the  highest  levels  of  mean  student  engage¬ 
ment  occurred  in  instructional  episodes  when  teacher  was  ob¬ 
served  providing  instrumental  support  (M  =  1.59),  while  the 
lowest  levels  occurred  in  instructional  episodes  when  teacher 
instrumental  support  was  not  observed  (M  =  1 .47).  The  degree  of 
variation  in  engagement  was  comparable  across  all  types  of  sup¬ 
port  and  obstruction  (.73-77).  Statistical  comparisons  of  student 
engagement  levels  associated  with  each  of  these  support  and 
obstruction  conditions  are  reported  in  the  cross-classified  analyses. 

Relations  of  Challenge  and  Support 
to  Student  Engagement 

To  address  our  questions  about  the  relationships  between  chal¬ 
lenge,  support,  obstruction,  and  engagement,  cross-classified  mod¬ 
els  (CCMs)  were  used  (Raudenbush  &  Bryk,  2002).  The  repeated 
measures  data  produced  by  the  ESM  are  hierarchical  in  two 
different  respects:  individual  ESM  responses  are  nested  within 
persons,  but  are  also  nested  within  particular  instructional  epi¬ 
sodes.  Because  instructional  episodes  are  not  purely  nested  within 
persons,  the  resultant  hierarchical  structure  is  cross-classified. 
Cross-classified  models  take  into  account  the  multiple  nesting 
structures  in  the  partitioning  of  variance  and  estimation  of  fixed 
and  random  effects  (Raudenbush  &  Bryk,  2002).  Initial,  fully 
unconditional  models  predicting  engagement  indicated  that  there 
was  sufficient  variance  within  each  of  the  nesting  structures  to 
warrant  this  particular  type  of  analysis.  Results  from  a  baseline 
null  cross-classified  model  indicated  that  10%  of  the  variance  in 
engagement  occurred  between  cell  (cross-classification  of  student 
and  instructional  episode),  about  37%  occurred  between-students, 
and  about  47%  was  attributed  to  instructional  episode.  We  wish  to 
acknowledge  here  that  both  the  students  and  the  instructional 
episodes  are  also  nested  within  classrooms;  however,  classroom  is 
not  modeled  as  a  level  in  our  analyses.  This  modeling  structure 
was  necessary  because  of  software  restrictions  about  the  number  of 
levels  that  can  be  included  in  cross-classified  models.  We  opted  for 
the  cross-classified  model  rather  than  the  more  pure  three-level 
structure  of  responses  nested  within  students  nested  within  class¬ 
rooms  because  our  primary  interest  was  in  the  characteristics  of  the 
instructional  episode  (in  terms  of  teachers’  momentary  affordances 
of  support  and  obstruction).  Additionally,  preliminary  analyses 
(not  presented  here)  comparing  the  two  competing  modeling  struc- 
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tures  indicated  that  the  classroom  level  accounted  for  only  1  %  of 
the  observed  variance  in  student  engagement,  whereas  instruc¬ 
tional  episode  accounted  for  10%.  Thus,  the  decision  to  use  a 
cross-classified  model  to  the  exclusion  of  the  classroom  level  was 
made  on  both  conceptual  and  methodological  grounds. 

The  (momentary)  outcome  in  all  of  our  analytical  models  is 
student  engagement,  which  is  predicted  by  a  single  within-student 
by  episode  variable,  students’  perception  of  momentary  challenge. 
Both  of  these  constructs  were  measured  via  ESM.  Predictors 
related  to  the  instructional  episode  include  the  dummy-coded 
indicators  of  teacher  instrumental  support,  teacher  emotional  sup¬ 
port,  teacher  instrumental  obstruction,  and  teacher  emotional  ob¬ 
struction,  which  were  ascertained  through  the  video  data  corre¬ 
sponding  to  students’  ESM  responses.  Predictors  related  to 
personal  characteristics  came  from  survey  data  and  included  gen¬ 
der  and  grade  in  school.  In  preliminary  analysis  not  presented  here, 
we  tested  whether  the  relationships  of  interest  varied  by  course 
subject  (general  science,  biology,  chemistry,  or  physics)  or  semes¬ 
ter  (fall,  spring),  and  did  not  find  significant  variation  in  the 
relationships  of  interest  by  these  factors.  For  this  reason,  course 
subject  and  semester  were  not  included  in  the  final  models. 

A  series  of  three  cross-classified  HLM  models  (CCMs)  was  run 
to  test  for  independent  and  interactive  associations  of  instrumental 
support,  emotional  support,  and  challenge  on  engagement,  while 
taking  into  consideration  the  personal  characteristics  of  grade 
and  gender.  Model  1  resulted  after  we  tested  for  the  main  fixed 
and  random  effects  of  perceived  challenge,  instrumental  sup¬ 
port,  and  emotional  support  on  students’  level  of  momentary 
engagement.  A  statistically  significant  random  effect  for  emo¬ 
tional  support  was  found,  and  was  retained  in  the  remaining 
models.  The  possibility  of  multicollinearity  resulting  from  the 
inclusion  of  both  instrumental  and  emotional  components  of 
support  or  obstruction  was  assessed  by  examining  estimates  and 
their  standard  errors  in  Model  1  and  in  a  reduced  model  without 
the  emotional  component;  small  differences  between  the  mod¬ 
els  suggest  that  multicollinearity  is  not  a  concern.  In  Model  2, 
instrumental  and  emotional  support  were  added  as  moderators 
of  the  association  between  challenge  and  engagement,  which 
allows  us  to  explore  how  the  relationship  between  challenge 
and  student  engagement  changes  as  a  function  of  teacher  sup¬ 
port  (the  interpretation  of  the  coefficients  y77  and  y12)  while 
controlling  for  the  main  effects  of  challenge,  instrumental  and 
emotional  support  on  student  engagement  (the  interpretation  of 
the  coefficients  0;,  y07,  and  y02,  respectively).  Thus,  Models  1 
and  2  focus  on  predictors  of  engagement  that  are  related  to 
characteristics  of  the  instructional  episode  (i.e.,  teacher  sup¬ 
port).  Model  3  incorporates  person-level  characteristics  as  pre¬ 
dictors,  testing  for  both  main  effects  of  gender  and  grade  on 
engagement,  as  well  as  moderating  effects  on  the  relationship 
between  challenge  and  engagement.  This  same  series  of  three 
models  was  then  replicated  using  instrumental  and  emotional 
obstruction  as  predictors  of  momentary  engagement. 

In  all  models,  all  predictor  variables  were  left  uncentered.  Thus, 
in  Models  1  and  2,  the  intercept  0O  indicates  the  mean  level  of 
student  engagement  when  perceived  challenge  =  0  (i.e.,  when 
students  indicated  their  science  work  was  “not  at  all  challenging”) 
and  no  teacher  emotional  or  instrumental  support  (or  obstruction) 
is  provided.  The  coefficient  0j  denotes  the  slope  of  challenge  and 
indicates  how  the  mean  level  of  engagement  changes  as  levels  of 


challenge  increase  (controlling  for  teacher  support/obstruction). 
When  the  person-level  variables  are  added  in  Model  3,  the  inter¬ 
cept  is  the  mean  level  of  student  engagement  when  12th-grade 
male  students  perceive  their  work  as  not  challenging  and  no 
teacher  support  (or  obstruction)  is  offered.  Adding  the  grade  and 
gender  variables  to  the  slope  of  challenge  results  in  estimates  that 
show  how  the  relationship  between  challenge  and  engagement 
changes  by  grade  and  gender  of  students  and  when  teacher  support 
(or  obstruction)  is  provided.  For  statistically  significant  coeffi¬ 
cients,  effect  sizes  (approximating  a  standardized  coefficient  for 
cross-classified  models)  were  computed  as  suggested  by  Lai  and 
Kwok  (2014).  Because  these  effect  sizes  are  not  conventional 
standardized  coefficients,  they  are  referred  to  by  the  abbreviation 
ES  rather  than  (3.  Given  that  they  approximate  standardized  coef¬ 
ficients,  rules  of  thumb  for  the  practical  meaning  of  their  magni¬ 
tude  can  be  applied. 

Teacher  support  and  perceived  challenge.  Table  3  presents 
the  results  of  a  series  of  cross-classified  HLM  models  (CCMs) 
testing  the  relationship  between  perceived  challenge,  teachers’ 
provision  of  support,  and  student  engagement.  Effect  sizes  of  all 
statistically  significant  coefficients  can  be  characterized  as  small. 
Model  1,  which  was  designed  to  address  Hypothesis  1,  indicates 
that  students’  perceived  challenge  was  positively  and  significantly 
related  to  their  engagement  in  science  learning  activities  (0,  =  .09, 
p  <  .001,  ES  =  .12).  Even  when  students  perceived  challenge  to 
be  low,  the  presence  of  teacher  instrumental  support  yielded  an 
increase  in  student  engagement  (y0/  =  .09,  p  <  .05,  ES  —  .12). 
While  teacher  provision  of  emotional  support  was  not  consistently 
related  to  student  engagement  (y02  =  —.03,  ns),  the  random  effect 
for  this  variable  suggests  that  there  is  heterogeneity  across  students 
in  this  relationship.  This  finding  persisted  across  Models  2  and  3. 
These  results  provide  partial  support  for  Hypothesis  1  in  that 
perceived  challenge  and  teacher  instrumental  support  were  posi¬ 
tively  associated  with  student  engagement,  while  teacher  emo¬ 
tional  support  was  not. 

Model  2  was  designed  to  address  Hypothesis  2,  and  explores 
whether  the  positive  relationship  between  challenge  and  engage¬ 
ment  is  stronger  in  the  presence  of  instrumental  or  emotional 
support.  The  nonsignificant  coefficient  for  the  instrumental  sup¬ 
port  indicator  used  to  predict  the  challenge  slope  (y!t  =  —.02,  ns) 
indicates  that  the  positive  association  between  instrumental  sup¬ 
port  and  challenge  remains  constant  as  students  perceive  activities 
to  be  more  challenging.  In  other  words,  teachers’  provision  of 
instrumental  support  is  associated  with  increases  in  student  en¬ 
gagement,  regardless  of  what  level  of  challenge  students  perceive. 
In  this  second  model,  teachers’  provision  of  emotional  support  has 
no  measurable  relationship  to  student  engagement,  either  as  a  main 
effect  or  through  a  relationship  with  challenge;  although  the  sta¬ 
tistically  significant  random  effect  suggests  that  the  relationship 
between  emotional  support  and  engagement  functions  differen¬ 
tially  across  students.  Teacher  instrumental  support  continues  to 
have  a  positive  relationship  with  student  engagement  as  a  main 
effect  (y0]  =  .10,  p  <  .05,  ES  =  .13),  and  does  not  moderate  the 
relationship  between  challenge  and  engagement.  These  findings  do 
not  provide  support  for  Hypothesis  2  in  that  the  relationship 
between  challenge  and  engagement  is  not  substantially  stronger 
when  teachers  offer  instrumental  or  emotional  support. 

Model  3  was  built  to  address  Hypotheses  3  and  4,  exploring  the 
role  of  gender  and  grade  level  in  the  relationship  between  chal- 
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Table  3 


Associations  of  Teachers  ’  Momentary  Provision  of  Instrumental  and  Emotional  Support  With 
Students’  Engagement  (CCREM) 


Fixed  effects 

Model  1 

Model  2 

Model  3 

B  SE 

B 

SE 

B 

SE 

For  intercept  1,  n0 

Intercept,  0O 

1.43***  .05 

1.44*** 

.06 

1.45*** 

.08 

Instrumental  support,  y0l 

.09*  .04 

.10* 

.05 

.09* 

.05 

Emotional  support,  y02 

-.03  .04 

-05 

.04 

-.05 

.04 

Female,  p01 

.02 

.07 

9th  grade,  (302 

-.02 

.08 

10th  grade,  (JH1 

-.06 

.09 

1 1th  grade,  po4 

-.09 

.21 

For  challenge,  it, 

Challenge  slope,  0, 

.09***  .01 

09** 

.03 

.09* 

.04 

Instrumental  support,  y  n 

-.02 

.03 

-.01 

.03 

Emotional  support,  yI2 

.02 

.03 

.02 

.03 

Female,  |3U 

-.05* 

.02 

9th  grade,  (312 

.02 

.03 

10th  grade,  (3I3 

.03 

.03 

11th  grade,  (3U 

.15* 

.07 

cr2 

a2 

a2 

Random  effects 

Within  students  by  IE.,  eijk 

.29372 

.29362 

.29257 

Between  IEs,  b^y 

.05550*** 

.05552*** 

.05602“ 

*** 

Between  students,  c00k 

.20649*** 

.20797*** 

.20798“ 

«** 

Between  students,  emotional  support  c02k 

.02989** 

.02994** 

.02939' 

Note.  HLM  =  hierarchical  linear  modeling;  CCREM  =  cross-classified  random-effects  modeling;  IE  = 
instructional  episode;  variance  components  tested  using  a  y2  test. 

*  p  <  .05.  *><.01.  **><.001. 


lenge,  support,  and  engagement.  The  addition  of  gender  and  grade 
to  the  model  alters  the  interpretation  of  the  coefficients  in  the 
model,  such  that  the  intercept  now  refers  to  mean  engagement  for 
12th  grade  males  with  no  challenge  and  no  teacher  support.  The 
nonsignificant  coefficients  for  main  effects  of  gender  and  grade 
level  (f30/,  (3 02,  (30J  and  indicate  that  males  and  females,  as 
well  as  students  at  all  grade  levels,  experience  similar  levels  of 
engagement  in  science  controlling  for  challenge  and  teacher  sup¬ 
port.  As  in  previous  models,  even  after  controlling  for  these 
personal  characteristics,  teachers’  instrumental  support  continues 
to  be  associated  with  higher  levels  of  student  engagement  (y0/  = 
.09,  p  <  .05,  ES  =  .12),  and  the  positive  relationship  between 
challenge  and  engagement  remains  constant  (0,  =  .09,  p  <  .05, 
ES  =  .12)  regardless  whether  instrumental  support  is  provided 
{yn  =  —.01,  ns).  As  in  prior  models,  teachers’  provision  of 
emotional  support  is  not  predictably  related  to  engagement  at  any 
level  of  challenge  (y02  =  —.05,  and  yI2,  =  .02,  both  ns)  but  the 
random  effect  suggests  that  the  relationship  functions  differently 
across  students. 

As  shown  in  Model  3  of  Table  3,  both  gender  and  grade  level 
appear  to  moderate  the  relationship  between  challenge  and  engage¬ 
ment.  Whereas  the  relationship  between  challenge  and  engagement 
is  positive  for  males  (01  =  .09,  p  <  .05,  ES  =  .  1 2),  it  is  diminished 
by  more  than  half  for  females  ((3 tJ=  —.05,  p  <  .05,  ES  —  .07). 
In  other  words,  male  students’  experienced  greater  increases  in 
engagement  when  they  perceived  their  science  work  as  challeng¬ 
ing  compared  with  female  students.  This  result  supports  Hypoth¬ 
esis  3.  Relative  to  12th  graders,  students  in  1 1th  grade  experienced 
steeper  increases  in  engagement  as  their  work  became  more  chal¬ 


lenging  ((374  =  .15,  p  <  .05,  ES  =  .20).  This  suggests  that 
challenge  may  have  more  utility  for  enhancing  engagement  for 
1 1th  graders  compared  with  students  who  are  in  their  last  year  of 
high  school.  Post  hoc  comparisons  of  differences  in  the  relation¬ 
ship  between  engagement  and  challenge  for  other  grade  levels  did 
not  yield  statistically  significant  results.  This  finding  provides 
partial  support  for  Hypothesis  4  in  that  the  relationship  between 
challenge  and  engagement  was  more  positive  for  11th  graders 
relative  to  12th  graders,  but  did  not  differ  significantly  from  9th  or 
10th  graders. 

Teacher  obstruction  and  perceived  challenge.  Table  4  pres¬ 
ents  a  series  of  cross-classified  HLM  models  (CCMs)  that  are 
identical  to  those  just  presented,  but  with  instrumental  and  emo¬ 
tional  obstruction  (rather  than  support)  as  predictors.  These  models 
were  designed  to  examine  the  relationship  between  teacher  ob¬ 
struction,  student  engagement,  and  student  characteristics,  and  to 
explore  whether  the  relationship  between  challenge  and  student 
engagement  varied  by  student  characteristics  or  by  the  presence  of 
teacher  obstruction.  Random  effects  for  challenge  and  teacher 
obstruction  were  examined,  and  the  statistically  significant  random 
effect  for  emotional  obstruction  was  retained  in  all  models.  Effect 
sizes  of  all  statistically  significant  coefficients  are  considered 
small. 

Consistent  with  our  prediction  stated  in  Hypothesis  5,  results  of 
Model  1  indicate  that  when  teachers  are  emotionally  obstructive, 
student  engagement  decreases  (y02  =  —  .11,  p  <  .01,  £5  =  .14). 
Surprisingly,  contrary  to  the  prediction  in  Hypothesis  5,  when 
teachers  were  instrumentally  obstructive,  student  engagement  in¬ 
creased  (y0,  =  .09,  p  <  .01,  ES  =  .12).  Thus,  results  provide  only 
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Table  4 

Associations  of  Teachers’  Momentary  Provision  of  Instrumental  and  Emotional  Obstruction  With 


Students’  Engagement  (CCREM) 


Fixed  effects 

Model  1 

Model  2 

Model  3 

B  SE 

B 

SE 

B 

SE 

For  intercept  1,  Tr0 

Intercept,  0O 

1.49***  .04 

1.48*’* 

.04 

1.51”* 

.07 

Instrumental  obstruction,  y0, 

.09*’  .03 

.16*** 

.04 

.16*** 

.04 

Emotional  obstruction,  y02 

-.11**  .04 

-.14** 

.05 

-.13*’ 

.05 

Female,  po, 

.00 

.07 

9th  grade,  (302 

-.02 

.08 

10th  grade,  (30, 

-.08 

.09 

1 1th  grade,  P04 

-.10 

.21 

For  challenge,  it. 

Challenge  slope,  0, 

.09***  .01 

.10*** 

.01 

.10*** 

.02 

Instrumental  obstruction,  yu 

.03 

-.07** 

.03 

Emotional  obstruction,  y12 

.03 

.03 

.02 

.03 

Female,  (3n 

—  ,04f 

.02 

9th  grade,  p12 

.02 

.03 

10th  grade,  (313 

.04 

.03 

1 1th  grade,  pl4 

.15* 

.07 

a2 

CT2 

a 

Random  effects 

Within  students  by  IE,  eijk 

.29406 

.29351 

.29221 

Between  IEs.,  b00j 

.05381*** 

.05363*** 

.05414**’ 

Between  students,  c^ 

.19936*** 

.19917*** 

.19771**’ 

Between  students,  emotional  obstruction,  c02k 

.03434*** 

.03354*** 

.03517*** 

Note.  HLM  =  hierarchical  linear  modeling;  CCREM  =  cross-classified  random-effects  modeling;  IE  = 
instructional  episode;  variance  components  tested  using  a  x2  test. 

><.l.  >  <  05.  *><.01.  **><.001. 


partial  support  for  Hypothesis  5.  Consistent  with  results  of  the 
support  analyses  presented  in  Table  3,  the  relationship  between 
academic  challenge  and  student  engagement  is  positive  in  this 
model  as  well  (0,  =  .09,  p  <  .001,  ES  =  .12),  net  of  the 
relationships  between  engagement  and  teacher  emotional  and  in¬ 
strumental  obstruction. 

Results  from  Model  2  address  Hypothesis  6,  and  indicate  that 
the  observed  positive  relationship  between  challenge  and  students’ 
engagement  is  consistent  whether  or  not  teachers’  emotional  ob¬ 
struction  is  present  (y,2  =  03,  ns).  Students’  engagement  de¬ 
creases  when  teacher  emotional  obstruction  is  present 
(y02  =  —  .14,  p  <  .001,  ES  =  .18),  and  the  random  effect  suggests 
a  heterogeneity  in  this  relationship  across  students.  These  results 
suggest  that,  regardless  of  the  level  of  challenge  students  were 
experiencing  at  the  time  (y/2  =  -03,  ns),  if  teachers  teased,  ha¬ 
rassed,  or  used  sarcasm  toward  any  student,  the  students  in  the 
entire  class  generally  disengaged,  and  the  level  of  this  disengage¬ 
ment  varied  across  students,  though  this  variation  was  not  ac¬ 
counted  for  by  predictors  in  our  model.  The  estimate  of  teachers’ 
instrumental  obstruction  on  the  challenge  slope  is  negative,  which 
suggests  that  as  challenge  increases,  the  experience  of  teacher 
instrumental  obstruction  tends  to  reduce  the  positive  relationship 
between  challenge  and  student  engagement  (y u  =  -.07,  p  <  .01, 
ES  =  .09),  almost  completely  diminishing  it.  Thus,  although  the 
mean  effect  of  teachers’  emotional  obstruction  on  student  engage¬ 
ment  is  negative  regardless  of  challenge  levels,  the  effect  of 
teachers’  instrumental  obstruction  depends  on  the  level  of  aca¬ 
demic  challenge  present  in  the  situation.  The  estimate  of  the  main 
effect  of  instrumental  obstruction  indicated  increased  engagement 


net  of  emotional  obstruction  and  challenge.  In  addition,  instrumen¬ 
tal  obstruction  had  a  dampening  effect  on  the  positive  relationship 
between  students’  perceived  challenge  and  their  engagement.  In 
other  words,  when  students  are  not  being  challenged  and  teachers 
present  instructional  obstacles,  students  report  increased  engage¬ 
ment,  perhaps  as  a  coping  mechanism  to  deal  with  these  obstacles. 
However,  as  challenge  increases,  the  addition  of  instrumentally 
obstructive  behaviors  by  teachers  does  not  similarly  increase  stu¬ 
dent  engagement.  These  results  provide  partial  support  for  Hy¬ 
pothesis  6.  As  predicted,  the  presence  of  instrumental  obstruction 
weakened  the  positive  relationship  between  challenge  and  engage¬ 
ment.  Similar  effects  were  not  obtained  for  teacher  emotional 
obstruction:  Emotional  obstruction  was  consistently  associated 
with  decreases  in  engagement,  regardless  of  the  level  of  challenge. 

In  Model  3,  all  of  the  previously  observed  relationships  between 
instrumental  obstruction,  emotional  obstruction,  challenge,  and 
engagement  persisted  after  controlling  for  gender  and  grade. 
Teacher  emotional  obstruction  continued  to  have  a  negative  rela¬ 
tionship  with  student  engagement,  net  of  challenge  and  student 
gender  and  grade  (y02  =  -.13,  p  <  .01,  ES  =  .17)  and  had  a 
statistically  significant  random  effect.  Teacher  instrumental  ob¬ 
struction  was  positively  associated  with  engagement  at  low  levels 
of  challenge  (y0i  =  .16,  p  <  .001,  ES  =  .21)  but  this  relationship 
was  dampened  substantially  as  challenge  increased  (-y u  —  —.07, 
p  <  .01,  ES  =  .09).  As  in  the  teacher  support  model,  11th  graders 
experienced  a  boost  to  the  impact  of  their  perceived  challenge  on 
engagement  when  compared  with  12th  graders  ((3/4  =  .15,  p  < 
.05,  ES  =  .20).  Post  hoc  comparisons  of  differences  in  the  rela¬ 
tionship  between  engagement  and  challenge  for  other  grade  levels 
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did  not  yield  statistically  significant  results.  However,  the  effect 
tor  gender  on  the  relationship  between  engagement  and  challenge 
was  nonsignificant  (0„  =  -.04,  p  =  .079),  indicating  that  once 
teachers  obstructive  behaviors  were  taken  into  account,  the  rela¬ 
tionship  between  engagement  and  challenge  was  the  same  for 
males  and  females. 

Discussion 

The  multimethod  design  used  in  this  study  allows  for  close  to 
real-time  tracking  of  students’  subjective  experience  as  it  relates  to 
what  teachers  are  doing  in  the  classroom,  and  to  their  own  personal 
characteristics.  The  study  contributes  to  existing  research  on 
teacher  practice  by  providing  a  descriptive  account  of  the  degree  to 
which  high  school  teachers'  behaviors  during  instruction  were 
supportive  or  obstructive  to  student  learning  and  emotional  well¬ 
being,  and  by  examining  the  independent  and  interactive  effects  of 
these  behaviors  on  student  engagement.  While  the  effect  sizes  for 
our  results  are  generally  small,  this  was  not  unanticipated  given 
that  the  measures  of  engagement,  challenge,  support,  and  obstruc¬ 
tion  are  latent  constructs  measured  at  the  momentary  level.  Our 
effect  sizes  are  consistent  with  what  is  often  found  with  this  type 
of  measurement  (see  McCoach,  Gable,  &  Madura,  2013).  The 
value  of  the  methodology  used  here  is  that  it  enables  researchers 
and  practitioners  to  begin  to  understand  how  momentary  changes 
in  one’s  educational  environment  can  result  in  small  changes  in 
student  experience  that,  over  time,  may  result  in  shifts  in  students’ 
motivational  trajectories. 

Appropriately  Challenging  Students  in  Science 

Our  finding  that  students  generally  perceived  low  levels  of 
challenge  in  their  daily  science  activities  is  consistent  with  previ¬ 
ous  research,  and  demonstrates  that,  using  a  variety  of  methods, 
high  school  students  experience  low  levels  of  challenge  at  school 
in  general  (Shemoff  et  al.,  2003;  Yazzie-Mintz,  2010)  and  in 
science  specifically  (Schmidt,  2010;  Strati  &  Schmidt,  2012). 
Science  teachers  in  particular  may  fail  to  appropriately  challenge 
their  students  in  a  well-intentioned  attempt  to  make  science  more 
accessible  to  them.  Based  on  classroom  observations  and  inter¬ 
views  with  teachers  across  a  variety  of  science  subject  areas, 
Shumow  and  Schmidt  (2014)  argued  that  teachers  perceive  that 
their  students  see  science  as  threatening,  and  intentionally  reduce 
the  degree  of  challenge  in  their  planning,  thinking  that  this  will 
help  students  feel  less  intimidated  by  science  and  like  it  better. 
Their  data  indicate,  however,  that  the  unintended  result  of  this 
practice  is  that  students  end  up  perceiving  science  as  boring. 
Budget  cuts  and  increased  pressure  for  science  teachers  to  promote 
mastery  of  discrete  scientific  constructs  to  increase  student  perfor¬ 
mance  on  standardized  tests  may  result  in  less  class  time  devoted 
to  labs  and  inquiry-based  learning  activities  that  could  provide 
both  the  excitement  and  challenge  that  are  inherent  in  true  scien¬ 
tific  activity.  Indeed,  across  the  110  class  periods  that  were  re¬ 
corded  for  this  analysis,  fewer  than  25  involved  anything  resem¬ 
bling  lab  activity  or  inquiry.  In  other  work  based  on  these  same 
data,  we  have  shown  that  the  labs  that  teachers  planned  at  this  level 
tended  to  be  rather  formulaic,  and  did  not  allow  for  much  inquiry 
or  challenge  (Shumow  &  Schmidt,  2014). 

Regardless  of  how  or  why  students  came  to  perceive  challenge 
as  low,  this  pattern  of  low-challenge  activity  in  science  is  distress¬ 


ing,  as  both  theory  and  research  (including  the  present  study)  point 
to  a  positive  relationship  between  challenge  and  student  engage¬ 
ment  and  motivation  (Csikszentmihalyi,  1990;  Henningsen  & 
Stein,  1997;  Lee  &  Smith,  1999;  Mant,  Wilson,  &  Coates,  2007; 
Shemoff  et  al.,  2003).  While  our  finding  that  challenge  is  posi¬ 
tively  related  to  engagement  is  certainly  not  unique,  when  coupled 
with  our  descriptive  findings  about  the  relatively  low  levels  of 
perceived  challenge  in  science  in  particular,  our  results  paint  an 
important  picture  for  science  educators.  Contrary  to  perceptions 
that  many  teachers  hold  about  students  being  intimidated  by  sci¬ 
ence,  our  results  suggest  that  in  this  sample  at  least,  students  may 
be  disengaged  from  science  not  because  they  are  intimidated  by 
the  content,  but  rather  because  they  are  not  challenged  by  it. 

The  fact  that  challenge  is  subjective  can  make  this  application  of 
our  research  findings  tricky  in  a  couple  of  different  ways.  First,  we 
know  that  any  given  classroom  activity  may  present  challenge  to 
one  student  and  may  not  challenge  another.  Differences  in  stu¬ 
dents’  readiness  for  course  content  is  the  perennial  reality  for  most 
educators,  and  is  likely  best  managed  through  sound  assessment, 
careful  monitoring,  teacher  scaffolding,  and  differentiated  instruc¬ 
tional  activities  (Tomlinson,  2001,  2014). 

The  second  difficulty  in  dealing  with  the  subjective  nature  of 
challenge  is  less  frequently  discussed  in  the  literature,  but  is 
highlighted  in  the  results  of  the  present  study.  A  unique  contribu¬ 
tion  of  our  study  is  that,  using  close  to  real-time  reports,  we 
demonstrate  how  the  experience  of  perceived  challenge  can  in¬ 
crease  engagement  for  some  students,  but  not  for  others.  We  find, 
for  example,  that  challenge  in  science  appears  to  be  more  moti¬ 
vating  to  males  than  it  is  to  females,  as  there  is  a  stronger 
association  between  challenge  and  engagement  among  males.  Sim¬ 
ilarly,  challenge  seems  to  be  more  motivating  for  11th  graders  as 
compared  with  12th  graders.  There  are  a  variety  of  reasons  why 
these  individual  differences  in  response  to  challenge  may  exist, 
and  it  is  beyond  the  scope  of  this  study  to  articulate  these  defini¬ 
tively.  We  can  speculate,  however,  that  females  may  be  less  likely 
to  respond  to  challenge  in  science  with  engagement  because  of 
widely  held  stereotypes  about  science  as  a  “male”  field  (Debacker 
&  Nelson,  2000;  Farenga  &  Joyce,  1999;  Jones,  Howe,  &  Rua, 
2000),  or  because  of  more  negative  general  attitudes  toward  sci¬ 
ence  among  females  (Freeman,  2004,  National  Center  for  Educa¬ 
tion  Statistics,  2000).  Both  of  these  beliefs  could  result  in  greater 
disengagement  among  females,  particularly  in  high-challenge  cir¬ 
cumstances,  where  a  significant  investment  of  time  and  effort  may 
be  required. 

The  pattern  of  disengagement  when  confronted  with  challenge 
may  also  be  explained  by  students’  views  of  science  ability  as 
fixed  or  malleable.  Individuals  who  believe  that  their  ability  is 
fixed  view  effort  as  futile,  believing  that  it  will  not  serve  to 
enhance  their  ability.  Moreover,  individuals  with  fixed  views  of 
ability  are  motivated  to  preserve  the  appearance  of  high  ability,  so 
challenging  activities  are  perceived  as  threats  to  their  self¬ 
presentation.  Thus,  they  tend  to  disengage  in  situations  that  are 
challenging  (Dweck,  2000).  There  is  some  evidence  suggesting 
that,  when  it  comes  to  STEM  disciplines  in  particular,  females  may 
view  their  abilities  as  fixed,  whereas  males  view  them  as  more 
malleable  (Hill,  Corbett,  &  St.  Rose,  2010).  The  patterns  we 
observed  in  our  data  may  suggest  that  adolescent  females  espouse 
a  more  fixed  view  of  their  science  ability. 
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It  is  noteworthy  that  the  dampening  effect  of  gender  on  the 
positive  relationship  between  challenge  and  engagement  essen¬ 
tially  disappears  in  the  statistical  model  that  also  accounts  for 
teachers’  emotional  and  instrumental  obstruction.  What  this  means 
is  that  once  this  aspect  of  teacher  behavior  is  accounted  for,  males 
and  females  respond  to  challenge  in  ways  that  are  more  similar 
than  different.  A  second  important  contribution  of  this  study  is  this 
preliminary  evidence  that  student  experiences  in  science  are 
shaped  by  the  interaction  of  teacher  behaviors  and  gender.  Future 
research  should  examine  more  deeply  the  effects  of  this  interaction 
on  student  experience. 

Our  finding  that  the  positive  association  between  challenge  and 
engagement  is  greater  for  1 1th  graders  than  it  is  for  12th  graders  is 
consistent  with  the  notion  that  1 1  th  graders  may  respond  to  aca¬ 
demic  challenge  with  deeper  engagement  because  of  the  weight 
that  is  given  to  1 1  th  grade  academic  performance  in  postsecondary 
admissions  decisions,  and  that  this  motivation  disappears  during  a 
“senior  slump"  in  12th  grade  (Kirst,  2001).  While  our  results  may 
suggest  a  senior  slump,  they  do  not  convincingly  support  our 
suggestion  that  the  relationship  between  challenge  and  engage¬ 
ment  will  be  stronger  in  higher  graders.  Post  hoc  analyses  suggest 
that  the  relationship  between  challenge  and  engagement  among 
11th  graders  is  not  significantly  stronger  than  that  of  9th  or  10th 
graders:  The  difference  that  emerged  was  between  11th  and  12th 
graders  only.  This  similarity  among  9th,  10th,  and  11th  graders 
identified  in  the  post  hoc  analyses  does  not  support  the  notion 
forwarded  in  our  review  of  literature  that  challenge  may  feel  more 
threatening  for  9th  graders  because  of  stressors  related  to  the 
transition  to  high  school  but  does  support  our  prediction  that 
challenge  does  less  to  promote  engagement  among  12th  grade 
students.  It  is  important,  however,  to  bear  in  mind  that  these  data 
are  not  longitudinal,  so  no  firm  conclusions  regarding  develop¬ 
mental  trends  can  be  made  without  further  study. 

All  of  the  explanations  proposed  above  suggest  the  important 
influence  that  context  has  on  how  students  perceive  their  educa¬ 
tional  experiences.  Future  studies  should  continue  to  examine 
differences  in  adolescents’  perception  of  challenge  as  motivating 
or  threatening,  as  well  as  the  causes  of  these  differences.  This 
research  should  also  explore  what  role  the  teacher  can  play  in 
assisting  students  who  tend  to  feel  more  threatened  by  challenge. 

Multiple  Dimensions  of  Teacher  Support 
and  Obstruction 

Our  findings  suggest  that  making  distinctions  between  the  in¬ 
strumental  and  emotional  dimensions  of  teacher  behavior  may  be 
helpful  in  understanding  the  complexities  of  secondary  students’ 
engagement,  while  demonstrating  that  it  is  also  informative  to 
explicitly  consider  teacher  obstruction  in  addition  to  teacher  sup¬ 
port.  This  represents  a  unique  contribution  to  the  literature  in  that 
few  studies  articulate  how  these  different  types  of  teacher  behavior 
relate  to  student  experience  in  classrooms.  Theoretically,  distin¬ 
guishing  instrumental  from  emotional  support  and  obstruction  may 
lead  to  more  intellectual  clarity  in  the  field,  as  it  allows  researchers 
to  explore  the  unique  ways  each  construct  is  related  to  student 
engagement  in  intellectually  stimulating  learning  environments. 

Specifically,  it  was  instrumental,  rather  than  emotional  support 
from  teachers  that  was  predictive  of  student  engagement  in  a 
consistent  way  for  high  school  students.  When  teachers  scaffold 


students’  efforts  in  academic  endeavors  through  the  use  of  struc¬ 
tured  questions,  appropriate  materials,  and  the  use  of  developmen¬ 
tal  feedback,  students’  engagement  increases.  This  type  of  teacher 
support  was  associated  with  student  engagement  across  all  levels 
of  challenge.  On  the  one  hand,  it  is  encouraging  that  instrumental 
support  was  observed  frequently  among  the  teachers  in  our  sample 
and  that  it  can  facilitate  student  engagement  across  a  number  of 
activities  that  present  differing  levels  of  challenge  for  students. 
These  findings  suggest  that  teachers  were  largely  committed  to 
providing  resources  and  assistance  for  students  to  complete  their 
learning  tasks,  and  that  their  efforts  served  to  increase  their  stu¬ 
dents’  engagement.  However,  the  ubiquity  of  teachers’  instrumen¬ 
tal  support  may  generate  some  cause  for  concern  when  considered 
in  tandem  with  the  finding  that  students  only  rarely  felt  challenged 
in  science.  By  their  own  report,  students  experienced  low  chal¬ 
lenge  in  science  class,  yet  their  teachers  provided  them  with  very 
regular  scaffolding  and  assistance.  Among  adolescents  who  may 
be  striving  for  autonomy,  teachers’  provision  of  assistance  when 
none  is  needed  not  only  suppresses  independent  problem  solving, 
but  also  may  send  the  message  that  teachers  have  low  expectations 
for  their  students  (Graham,  1990),  which  could  in  turn,  cause 
students  to  doubt  their  own  competence. 

Our  findings  regarding  instrumental  obstruction — the  antithesis 
of  support — are  consistent  with  this  concern.  At  low  levels  of 
academic  challenge,  when  teachers  fail  to  provide  the  necessary 
materials  for  students  to  complete  the  learning  task,  or  ignore 
student  requests  for  help,  students  actually  become  more  engaged. 
However,  the  presence  of  instrumental  obstruction  decreases  the 
positive  impact  academic  challenge  has  on  student  engagement. 
This  suggests  to  us  that  when  students  are  not  being  challenged  by 
their  science  content,  encountering  insufficient  materials,  unclear 
instructions,  or  even  a  neglectful  teacher  may  create  just  a  little  bit 
of  logistical  challenge  that  actually  serves  to  draw  students  into  the 
activity.  However,  as  activities  become  more  challenging,  and 
more  of  students’  mental  effort  is  directed  toward  the  learning 
activity  itself,  such  obstructive  behaviors  may  serve  as  distrac¬ 
tions,  leading  to  less  engagement  than  students  might  otherwise 
experience  given  the  higher  levels  of  challenge.  This  explanation 
is  speculative  at  this  point,  but  seems  a  plausible  one  for  the 
interactive  effects  of  challenge  and  instrumental  obstruction  on 
engagement. 

Teacher  instructional  behaviors  that  are  directed  at  students’ 
emotional  well-being  tell  a  very  different  story  than  teachers’ 
instrumental  behaviors.  Whereas  teachers’  instrumental  support 
was  consistently  and  significantly  predictive  of  engagement,  the 
impact  of  teachers’  emotional  support  operated  differently.  The 
encouragement,  acceptance  and  caring  behaviors  we  observed 
among  teachers  were  not  appreciably  related  to  student  engage¬ 
ment  at  any  level  of  challenge.  This  null  finding  is  noteworthy  as 
it  contradicts  a  large  body  of  research  that  consistently  finds 
positive  associations  between  teacher  emotional  support  and  stu¬ 
dent  engagement  in  younger  samples  of  students  (Hughes  & 
Kwok,  2007;  Marks,  2000;  Murdock,  1999;  Murray,  2009;  Patrick, 
Ryan,  &  Kaplan,  2007;  Ryan  &  Patrick,  2001).  There  are  several 
possible  explanations  for  these  findings.  First,  the  teachers  in  this 
sample  provided  emotional  support  less  frequently  than  they  pro¬ 
vided  instrumental  support,  so  it  is  possible  that  students  just  did 
not  have  adequate  exposure  to  supportive  comments  for  them  to  be 
impactful.  Second,  during  high  school,  overt  emotional  support 
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from  teachers  may  be  less  salient  than  it  is  at  younger  grade  levels 
(Hamre  &  Pianta,  2001;  Pianta  et  al„  2005;  Pianta,  Nimetz,  & 
Bennett,  1997;  Saft  &  Pianta,  2001)  for  engaging  youth  in  their 
studies.  Adolescents  are  beginning  to  demonstrate  more  autonomy 
in  their  academic  pursuits  and  thus  may  not  need,  or  may  not  be  as 
receptive  to  bids  of  emotional  support  by  teachers.  They  may  rely 
more  heavily  on  friends  for  emotional  support,  and  thus  may  not 
draw  as  many  academic  benefits  from  emotionally  supportive 
teachers.  Third,  most  high  schools  are,  by  design,  more  bureau¬ 
cratic  and  impersonal  than  middle  or  elementary  schools,  affording 
few  opportunities  for  emotional  connection  between  students  and 
teachers  (Anderman  &  Anderman,  2010;  Eccles  et  al„  1993).  As  a 
result  teachers’  words  of  encouragement  and  appreciation  may  not 
seem  particularly  genuine  or  impactful  to  students  because  they 
lack  strong  interpersonal  connection  with  their  teachers.  A  fourth 
explanation  is  that  there  might  be  additional  aspects  of  emotional 
support  that  exert  influence  on  student  engagement,  but  we  were 
not  able  to  capture  these  dimensions  in  our  observational  coding. 
A  final  possibility  to  consider  is  that  teachers’  emotionally  sup¬ 
portive  behaviors  as  they  are  coded  here  may,  in  fact,  be  incredibly 
important  for  maintaining  adolescents’  emotional  well-being,  but 
that  this  importance  may  not  necessarily  translate  into  increased 
academic  engagement  in  science.  Engagement  may  simply  not  be 
the  correct  outcome  to  measure  the  importance  of  teachers’  emo¬ 
tional  support  for  high  school  students.  Given  the  multiple  possible 
explanations  for  our  null  findings,  the  role  of  emotional  support  in 
shaping  students’  classroom  experiences  at  the  high  school  level 
merits  further  study. 

While  teachers’  attempts  to  be  emotionally  supportive  were  not 
generally  helpful  for  promoting  student  engagement,  teacher  be¬ 
haviors  that  are  overtly  obstructive  emotionally  appeared  to  be 
detrimental  to  student  engagement.  During  those  episodes  when 
we  observed  teachers  to  be  emotionally  obstructive  by  teasing  or 
ridiculing  students,  students’  engagement  showed  significant  de¬ 
cline,  across  all  levels  of  challenge.  This  finding  is  of  critical 
importance  to  high  school  teachers  who  often  use  sarcasm  and 
good-natured  teasing  as  a  means  of  connecting  with  their  students 
(Shumow  &  Schmidt,  2014).  Our  findings  suggest  that  these 
attempts  at  connection  may  do  more  harm  than  good  for  engaging 
students  academically,  at  least  in  the  moment  when  the  sarcasm 
and  ridicule  are  happening.  This  finding  is  particularly  compelling 
because  many  of  the  teachers’  sarcastic  comments  were  directed 
toward  individual  students,  yet  in  those  moments  when  these 
comments  were  made,  the  engagement  for  all  students  in  the  class 
generally  declined.  This  finding  highlights  how  a  teacher’s  inter¬ 
action  with  a  single  student  has  the  potential  to  impact  an  entire 
classroom  whether  intentional  or  not.  What  is  perhaps  most  dis¬ 
tressing  is  that  teachers  were  observed  to  be  emotionally  obstruc¬ 
tive  in  approximately  one-third  of  the  instructional  episodes  we 
observed.  Research  on  the  frequency  of  obstructive  teacher  behav¬ 
iors  is  scant.  However,  the  few  studies  we  located  that  do  examine 
nonsupportive  teacher  behaviors  corroborate  our  findings  regard¬ 
ing  the  frequency  of  teacher  obstruction,  with  obstructive  com¬ 
ments  accounting  for  as  much  as  33—53%  of  teacher  discourse 
(Meyer  &  Turner,  2002;  Walsh,  2002). 

The  significant  random  effects  of  both  emotional  support  and 
emotional  obstruction  suggest  that  the  emotional  aspect  of  teacher 
behavior  is  an  area  ripe  for  future  study.  Results  suggest  that  the 
relationship  of  both  emotional  support  and  emotional  obstruction 


with  student  engagement  was  not  consistent  across  students,  mean¬ 
ing  these  relationships  might  be  negative  for  some  students  but  not 
for  others.  Future  research  should  explore  whether  there  are  per¬ 
sonal  characteristics  that  can  predictably  explain  the  variation  in 
this  relationship.  Results  of  this  type  of  study  might  suggest  ways 
that  teachers  can  most  effectively  support  their  students. 

Limitations 

While  the  student  sample  was  relatively  diverse  in  terms  of  race, 
ethnicity,  and  socioeconomic  status,  the  fact  that  the  sample  was 
relatively  small,  and  was  drawn  from  a  single  high  school  may 
limit  the  generalizability  of  our  results.  It  is  worth  noting,  however, 
that  the  outcome  of  interest  (momentary  engagement)  and  major 
predictor  variables  (momentary  challenge,  teacher  support,  and 
teacher  obstruction)  were  measured  at  multiple  time  points  ( n  = 
3,803)  that  were  randomly  selected  from  across  the  gamut  of 
science  activities:  Thus,  the  sample  of  science  experiences  that  was 
used  to  construct  the  statistical  models  is  quite  heterogeneous  and 
likely  representative  of  the  science  experience  of  this  population. 

A  second  limitation  is  that  the  study  design  precludes  the 
conclusion  of  causal  relationships  between  teacher  instructional 
behaviors  and  student  engagement.  While  the  teacher  instructional 
behaviors  we  observed  and  coded  temporally  preceded  students’ 
reports  of  their  engagement,  it  is  not  possible  to  draw  definitive 
causal  conclusions  about  their  relations  to  student  engagement 
levels.  Related  to  this  issue  of  temporality  and  causality,  some 
researchers  have  suggested  that  the  nature  of  the  relationship 
between  teacher  support  and  student  engagement  is  bidirectional 
(Connell  &  Wellborn,  1991;  Furrer  &  Skinner,  2009;  Skinner  & 
Belmont,  1993).  In  our  analytic  models  we  treated  teacher  support 
(both  emotional  and  instrumental)  as  predictors  of  engagement, 
and  did  not  attempt  to  model  a  bidirectional  relationship.  Future 
studies  can  more  closely  investigate  whether  and  how  student 
engagement  may  elicit  higher  levels  of  teacher  support. 

A  third  limitation  is  that  our  measures  of  teacher  support  and 
obstruction,  while  reliable  and  based  on  fairly  objective  criteria, 
were  rather  broad  and  did  not  permit  us  to  explore  teacher  inten- 
tionality  or  gradations  in  the  degree  to  which  teachers  were  sup¬ 
portive  or  obstructive.  Thus,  only  the  broadest  brushstrokes  were 
measured  (presence  vs.  absence).  Future  research  should  meaning¬ 
fully  and  reliably  measure  gradations  in  teachers’  provision  of 
support  or  obstruction. 

Finally,  because  of  their  relative  infrequency,  our  coding  of 
teachers’  instrumental  and  emotional  obstruction  combined  two 
potentially  distinct  types  of  obstructive  behavior — purposeful  un¬ 
dermining  and  neglect.  These  can  be  intentional  or  unintentional. 
It  is,  therefore,  possible  that  these  two  types  of  obstructive  behav¬ 
ior  may  differentially  moderate  the  relationship  between  challenge 
and  engagement.  Although  we  did  not  directly  measure  teacher 
intentionality  in  our  coding  scheme,  these  instructional  behaviors 
were  nonetheless  perceived  as  not  motivating  by  students  and  were 
associated  with  lower  engagement.  This  area  is  ripe  of  future 
research. 

Conclusion 

Student  engagement  in  academics  is  multidimensional  and  fluc¬ 
tuates  from  moment  to  moment.  The  unique  methodology  used  in 
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this  study  enabled  us  to  link  fluctuations  in  students’  perception  of 
challenge  and  teachers’  provision  of  support  and  obstruction  to 
variation  in  students’  reported  engagement  in  science  classrooms. 
The  analytic  approach,  which  accounted  for  multiple  nesting  struc¬ 
tures  in  the  data,  identified  ways  in  which  students’  personal 
characteristics  and  teachers’  instructional  behaviors  moderate  the 
linkages  between  students’  momentary  subjective  perceptions  and 
their  engagement  in  science.  Our  results  suggest  that  this  particular 
methodology  may  be  fruitful  for  further  unpacking  the  nature  of 
challenge,  support,  and  obstruction  in  secondary  classrooms,  and 
for  enhancing  our  understanding  of  the  proximal  classroom  factors 
that  are  associated  with  student  engagement  in  science. 
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Correction  to  Strati,  Schmidt,  and  Maier  (2016) 

In  the  article  “Perceived  Challenge,  Teacher  Support,  and  Teacher  Obstruction  as  Predictors  of 
Student  Engagement”  by  Anna  D.  Strati,  Jennifer  A.  Schmidt,  and  Kimberly  S.  Maier  ( Journal  of 
Educational  Psychology,  Advanced  online  publication.  March  3,  2016.  http://dx.doi.org/10.1037/ 
edu0000108),  the  sixth  sentence  of  the  Relations  of  Challenge  and  Support  subsection  of  the 
Results  section  should  read  “Results  from  a  baseline  null  cross-classified  model  indicated  that  53% 
of  the  variance  in  engagement  occurred  between  cell  (cross-classification  of  student  and  instruc¬ 
tional  episode),  about  37%  occurred  between-students,  and  about  10%  was  attributed  to  instruc¬ 
tional  episode.” 

http  ://dx .  doi .  org/ 10.1037 /eduOOOO  1 36 
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